One of the most difficult tasks for a data scientist is cleaning the data before diving in to find any useful insights. One should not skip data cleaning as it is a crucial procedure. If the data is not adequately cleansed, your model’s accuracy is questionable.
It is crucial to adequately clean the data before fitting a model to it since poor data quality produces biased results with low accuracy and high error percentages. As a data scientist, it is important to understand that all the data provided to us may not be useful and hence we must know the ways to treat them.
1. What is Data Cleaning?
Data cleaning refers to the process of identifying and correcting or removing inaccurate, incomplete, or irrelevant data from a dataset. The goal of data cleaning is to ensure that the data is accurate, complete, and consistent so that it can be used effectively for analysis, modeling, or visualization. This involves identifying missing values, removing duplicates, checking for inconsistent or incorrect values, removing outliers, standardizing the data, checking for data integrity, and validating the data. Data cleaning is an essential step in the data preparation process before using the data for any analysis or modeling.
2. Step Involved in the Data Cleaning Process With Example
Step Involved in the Data Cleaning Process
2.1 Identify and Handle Missing Values
In this step, we will identify any missing values in the dataset and handle them by either removing rows with missing data or imputing the missing values using methods like mean, median, or regression imputation.
Data:
# Import necessary module
import pandas as pd
import numpy as np
# Create a sample dataset
data = {‘Name’: [‘John’, ‘Jane’, ‘Mike’, ‘David’, ‘Sarah’, np.nan, ‘Emily’, ‘Tom’, ‘Jessica’,’John’],
‘Age’: [25, 35, 42, np.nan, 28, 31, 39, 47, np.nan,25],
‘Gender’: [‘Male’, ‘Unknown’, ‘Male’, ‘Male’, ‘Female’, ‘Female’, ‘Female’, ‘Male’, ‘Female’,’Male’],
‘Salary’: [50000, 75000, np.nan, 60000, 65000, 80000, 900000, np.nan, 155000,50000]}
# Create pandas DataFrame
df = pd.DataFrame(data)
Let’s remove with missing data
# Import necessary modules
import pandas as pd
import numpy as np
# Check for missing values
df.isnull().sum()
# Handle missing values by dropping rows with missing data
df.dropna(inplace=True)
Yields the below output.
# Output:
Name 1
Age 2
Gender 0
Salary 2
dtype: int64
Name Age Gender Salary
0 John 25.0 Male 50000.0
1 Jane 35.0 Unknown 75000.0
2 Mike 42.0 Male NaN
3 David NaN Male 60000.0
4 Sarah 28.0 Female 65000.0
5 NaN 31.0 Female 80000.0
6 Emily 39.0 Female 900000.0
7 Tom 47.0 Male NaN
8 Jessica NaN Female 155000.0
9 John 25.0 Male 50000.0
Name Age Gender Salary
0 John 25.0 Male 50000.0
1 Jane 35.0 Unknown 75000.0
4 Sarah 28.0 Female 65000.0
6 Emily 39.0 Female 900000.0
9 John 25.0 Male 50000.0
2.2 Remove duplicates
In this step, we will identify and remove duplicates from the dataset.
CODE
# Check for duplicates
df.duplicated().sum()
# Drop duplicates
df.drop_duplicates(inplace=True)
OUTPUT
# Output:
1
Name Age Gender Salary
0 John 25.0 Male 50000.0
1 Jane 35.0 Unknown 75000.0
4 Sarah 28.0 Female 65000.0
6 Emily 39.0 Female 900000.0
9 John 25.0 Male 50000.0
Name Age Gender Salary
0 John 25.0 Male 50000.0
1 Jane 35.0 Unknown 75000.0
4 Sarah 28.0 Female 65000.0
6 Emily 39.0 Female 900000.0
2.3 Check for inconsistent or incorrect values
In this step, we will check for any inconsistent or incorrect values in the data. Correct any errors or inconsistencies in the data.
CODE
# Check for inconsistent or incorrect values
df[‘Gender’].unique()
# Replace inconsistent or incorrect values with NaN
df[‘Gender’].replace(‘Unknown’, np.nan, inplace=True)
# Check for missing values
df.isnull().sum()
OUTPUT
# Output:
array([‘Male’, ‘Female’, ‘Unknown’], dtype=object)
Name 0
Age 0
Gender 1
Salary 0
dtype: int64
Name Age Gender Salary
0 John 25.0 Male 50000.0
1 Jane 35.0 NaN 75000.0
4 Sarah 28.0 Female 65000.0
6 Emily 39.0 Female 900000.0
2.4 Remove Outliers
In this step, we will identify and remove any outliers that can skew the results.
CODE
# Identify outliers using z-score
from scipy.stats import zscore
df[‘z_score’] = zscore(df[‘Salary’])
df = df.loc[df[‘z_score’].abs() < 3]
# remove z-score column
df.drop(‘z_score’, axis=1, inplace=True)
OUTPUT
# Output:
Name Age Gender Salary
0 John 25.0 Male 50000.0
1 Jane 35.0 NaN 75000.0
4 Sarah 28.0 Female 65000.0
6 Emily 39.0 Female 900000.0
Name Age Gender Salary
0 John 25.0 Male 50000.0
1 Jane 35.0 NaN 75000.0
4 Sarah 28.0 Female 65000.0
2.5 Standardize the Data
In this step, we will standardize the data by ensuring that all data are in the same format and using the same units of measurement.
CODE
# Standardize the data
df[‘Gender’] = df[‘Gender’].replace({‘Male’: ‘M’, ‘Female’: ‘F’})
df[‘Salary’] = df[‘Salary’] / 1000
OUTPUT
# Output:
Name Age Gender Salary
0 John 25.0 M 50.0
1 Jane 35.0 NaN 75.0
4 Sarah 28.0 F 65.0
6 Emily 39.0 F 900.0
2.6 Check for Data Integrity
In this step, we will check for any data integrity issues such as missing values, duplicates, or incorrect data types.
CODE
# check for missing values
df.isnull().sum()
# check for duplicates
df.duplicated().sum()
# Drop duplicates
df.drop_duplicates(inplace=True)
# Fill missing values with mean
df.fillna(df.mean(), inplace=True)
OUTPUT
# Output:
Name 0
Age 1
Gender 0
Salary 0
dtype: int64
1
Name Age Gender Salary
0 John 25.0 M 50.0
1 Jane 35.0 F 75.0
4 Sarah 28.0 F 65.0
6 Emily 39.0 F 900.0
Name object
Age float64
Gender object
Salary int64
dtype: object
2.7 Transform the Data
In this step, we will transform the data as required. This may involve feature engineering, scaling the data, or creating new features.
CODE
# Normalize the data
from sklearn.preprocessing import MinMaxScaler
scaler = MinMaxScaler()
df[[‘Age’, ‘Salary’]] = scaler.fit_transform(df[[‘Age’, ‘Salary’]])
OUTPUT
# Output:
Name Age Gender Salary
0 John 0.000000 M 0.000000
1 Jane 0.714286 F 0.029412
4 Sarah 0.214286 F 0.017647
6 Emily 1.000000 F 1.000000
2.8 Export the Cleaned Data
In this step, we will export the cleaned data to a new file or overwrite the existing file with the cleaned data.
CODE
# Export the cleaned data
df.to_csv(‘cleaned_data.csv’, index=False)
OUTPUT
# Output:
Name Age Gender Salary
0 John 0.000000 M 0.000000
1 Jane 0.714286 F 0.029412
4 Sarah 0.214286 F 0.017647
6 Emily 1.000000 F 1.000000
3. Advantage of Data Cleaning
Data cleaning, also known as data cleansing or data scrubbing, is the process of identifying and correcting errors, inconsistencies, and inaccuracies in data. The main advantage of data cleaning is that it improves the quality of data, making it more accurate, complete, and consistent. Here are some specific advantages of data cleaning:
Better decision-making: High-quality data is essential for making informed decisions. Data cleaning ensures that the data used for analysis and decision-making is accurate and reliable.
Increased efficiency: Data cleaning saves time by reducing the need to manually correct errors and inconsistencies. This enables analysts and data scientists to focus on analyzing the data rather than cleaning it.
Improved data quality: By identifying and correcting errors, data cleaning improves the overall quality of the data. This, in turn, leads to more accurate results and better insights.
Greater data consistency: Consistent data is essential for comparing and analyzing data across different sources. Data cleaning ensures that data is consistent across different sources, improving its usability.
Better data integration: Data cleaning can help to integrate data from different sources by ensuring that the data is compatible and consistent.
4. Disadvantage of Data Cleaning
While data cleaning provides significant advantages, there are also some disadvantages to consider. Here are a few:
Time-consuming: Data cleaning can be a time-consuming process, especially when dealing with large datasets. It requires careful analysis and attention to detail to identify errors and inconsistencies, and correcting them can be a tedious process.
Costly: Depending on the size and complexity of the data, data cleaning can be costly, requiring dedicated resources and specialized software tools. This can be a significant expense for businesses, particularly smaller ones.
Risk of data loss: In some cases, data cleaning can result in the loss of valuable information. If errors or inconsistencies are incorrectly identified and corrected, it can result in the loss of important data.
Difficulty in identifying errors: Some errors in data may be difficult to identify, particularly if they are subtle or hidden within the data. This can result in incomplete or inaccurate data, which can negatively impact decision-making.
Potential bias: Data cleaning may inadvertently introduce bias into the data, particularly if certain data points are removed or altered in the process. This can lead to inaccurate results and skewed analysis.
4. Conclusion
Data cleaning is an important step in the data analysis process. It ensures that the data is accurate, complete, and consistent, which is essential for making informed decisions. In this tutorial, we have discussed the steps involved in data cleaning and provided code examples to demonstrate each step. By following these steps, you can ensure that your data is clean and ready for analysis.
One of the most difficult tasks for a data scientist is cleaning the data before diving in to find any useful insights. One should not skip data cleaning as it is a crucial procedure. If the data is not adequately cleansed, your model’s accuracy is questionable. It is crucial to adequately clean the data before Read More Machine Learning