The goal of this project was to develop a procedure and event that would automatically perform a couple different data cleaning tasks, including removing duplicates, fixing typos, and general standardization.
I started with this US Household Income CSV with 32,533 rows of data. The table was imported into MySQL. Here is an excerpt of the CSV.
It was discovered that there were a couple of duplicate rows in the data, as well as a couple of misspellings and case inconsistencies.
The following code was used to find the duplicates and the result follows.
A quick view of the state names shows what must be a mistake of the name Georgia with only 1 instance between the correctly-spelled Georgia and Hawaii.
Considering a couple of other standardization issues, the following code was included in the procedure.
Before any cleaning actually takes place, the data will be copied to a new table, preserving the original imported data. A timestamp is added to the new table for use in an event that runs the procedure.
So the final procedure looks like this.
The procedure is now fully put together. It's time to double check that it works as intended. The code for checking for duplicates was copied outside the procedure and ran. The result did not find any duplicates.
A double-check of the state names appears to have cleaned up the improperly-spelled Georgia entry.
So it appears that our code excerpts have worked as intended and the procedure is ready to run. The last step is to create the event that can run the procedure, and decide how often it should run.
The procedure could also run as a trigger, as opposed to an event.
So that's it for this project. We created an event that runs a procedure that automatically deletes duplicate rows and carries out some other standardization tasks. Thanks for your attention, I hope you enjoyed this project.