Analytics Tips #4 - 5 Simple Steps to Automate Data Cleaning with Python
Automate your data cleaning process with a practical 5-step pipeline in Python
Hey everyone! This is Josep, one more week 👋🏻
This week, we have a third issue of the Analytics Tips series, where we'll explore different aspects of data science in an easy-to-understand way.
It is a widely spread fact among Data Scientists that data cleaning makes up a big proportion of our working time. However, it is one of the least exciting parts as well.
So this leads to a very natural question:
Is there a way to automate this process?
Automating any process is always easier said than done since the steps to perform depend mostly on the specific project and goal.
But there are always ways to automate, at least, some of the parts.
Data Cleaning Process
Before proceeding to generate the pipeline, we need to understand what parts of the processes can be automated.
Since we want to build a process that can be used for almost any data science project, we need to first determine what steps are performed over and over again.
So when working with a new data set, we usually ask the following questions:
What format does the data come in?
Does the data contain duplicates?
Does the data contain missing values?
What data types does the data contain?
Does the data contain outliers?
These 5 questions can easily be converted into 5 blocks of code to deal with each of the questions:
1.Data Format
Data can come in different formats, such as JSON, CSV, or even XML.
Every format requires its own data parser. For instance, pandas provide read_csv for CSV files, and read_json for JSON files.
By identifying the format, you can choose the right tool to begin the cleaning process.
We can easily identify the format of the file we are dealing with using the path.plaintext
function from the os
library.
Therefore, we can create a function that first determines what extension we have, and then applies directly to the corresponding parser.
2. Duplicates
It happens quite often that some rows of the data contain the same exact values as other rows, what we know as duplicates. Duplicated data can skew results and lead to inaccurate analyses, which is not good at all.
This is why we always need to make sure there are no duplicates.
Pandas got us covered with the drop_duplicated() method, which erases all duplicated rows of a dataframe.
We can create a straightforward function that utilizes this method to remove all duplicates. If necessary, we add a columns input variable that adapts the function to eliminate duplicates based on a specific list of column names.
3. Missing Values
Missing data is a common issue when working with data as well. Depending on the nature of your data, we can simply delete the observations containing missing values, or we can fill these gaps using methods like forward fill, backward fill, or substituting with the mean or median of the column.
Pandas offers us the .fillna() and .dropna() methods to handle these missing values effectively.
The choice of how we handle missing values depends on:
The type of values that are missing
The proportion of missing values relative to the number of total records we have.
4. Dealing with Data Types
Ensuring correct data types is crucial for data analysis. Sometimes, data types are incorrectly assigned, such as treating a numeric column as a string.
The main problem in dealing with data types is that there are too many scenarios to consider. This is why, the most straightforward way to automate the cleaning of data is to define the data types we expect our table to have and receive a warning in case there is a mismatch.
5. Dealing with Outliers
Outliers can significantly affect the results of your data analysis. Techniques to handle outliers include setting thresholds, capping values, or using statistical methods like Z-score.
In order to determine if we have outliers in our dataset, we use a common rule and consider any record outside of the following range as an outlier.
[Q1 — 1.5 * IQR , Q3 + 1.5 * IQR]
Where IQR stands for the interquartile range and Q1 and Q3 are the 1st and the 3rd quartiles. Below you can observe all the previous concepts displayed in a boxplot.
Final Recommendations
Data Cleaning is a crucial part of any data project, however, it is usually the most boring and time-wasting phase as well. This is why a semi-automated pipeline can boost this process and allow us to save up some time.
This blend of automation with human oversight ensures both efficiency and accuracy, making it a robust solution for data scientists aiming to optimize their workflow.
To learn more you can check the following article and you can go check my whole code in the following GitHub repo.
And this is all for now!
If you have any suggestions or preferences, please comment below or message me through my social media!
Remember you can also find me in X, Threads, Medium and LinkedIn 🤓
Clear explanations as usual🙌