The goal of this project was to take a dirty dataset and clean and prepare it for analysis. In this documentation you will find some of the steps taken in this process as well as the SQL code that was used to clean and prepare the data. In some instances, the table used is the "backup" table, since the new table was already cleaned.
The CSV with the data was imported into MySQL. The CSV had 2942 rows and looked something like this.
The first step of the process was to look for duplicates. To do this, the country was concatenated with the year to see if any existed. 3 duplicates were found in this process.
Next, a window function was used with row number to find the row IDs of the duplicate rows.
The duplicate rows were then deleted with the following code.
The next step was to do some cleaning that was clearly necessary simply by looking at the CSV. The status of some countries was blank, so this was a clear next step to clean. The following code was used to find the countries with a blank status.
To find out what the options for status could be, the following code was ran, which output the possibilities "developing" or "developed".
Considering that each country had statuses in other rows, their status could be gathered from the row of the same country from a different year. Using a self-join, the following output could be achieved.
Next, the following bit of code would update the status blanks with developing from a non-blank year that was also developing. The same code could be ran with the blanks that were developed instead.
This result could then be checked, as well as check for any additional nulls with the following code.
Next, it's time to look at the blanks in the Life Expectancy column.
There is a missing life expectancy value in 2018, but the values seem to slowly trend upwards. So, it could be possible to use an average of the next/previous values to fill in the blank data.
This code will require a self-join, in which the rows below and above the missing values will be used to calculate an average. The following code can be used to do so.
The last step was to update the table using this code.
And that concludes the cleaning and preparation for this dataset. Thanks for your attention, I hope you enjoyed it.