Skip to content

Data Processing in ML Narender Kumar Spark By {Examples}

  • by

Data Preprocessing includes the steps we need to follow to transform or encode data so that it may be easily parsed by the machine. 

The main agenda for a model to be accurate and precise in predictions is that the algorithm should be able to easily interpret the data’s features. 

1. What is Data Processing

Data preprocessing is a critical step in machine learning that involves preparing the raw data for analysis by cleaning, transforming, and integrating it into a usable format. The main objective of data preprocessing is to improve the quality of the data and eliminate any inconsistencies or biases that may impact the accuracy and effectiveness of the machine learning model. The following are the key steps involved in data processing

Data Preprocessing Steps

1.1 Data collection:

The first step in data processing is collecting data from various sources. This can be done manually or automatically using software tools.

Here’s an example of data collection using Python code to collect data on the weather from an API and store it in a CSV file.

import requests
import csv

# Set up the API request
url = ‘https://api.openweathermap.org/data/2.5/weather?q=London,uk&appid=YOUR_API_KEY&units=metric’
response = requests.get(url)

# Parse the response JSON
data = response.json()

# Extract the weather data
weather_data = {
‘location’: data[‘name’],
‘temperature’: data[‘main’][‘temp’],
‘humidity’: data[‘main’][‘humidity’],
‘wind_speed’: data[‘wind’][‘speed’],
‘description’: data[‘weather’][0][‘description’]
}

# Save the data to a CSV file
with open(‘weather_data.csv’, mode=’w’, newline=”) as file:
writer = csv.writer(file)
writer.writerow([‘Location’, ‘Temperature’, ‘Humidity’, ‘Wind Speed’, ‘Description’])
writer.writerow([weather_data[‘location’], weather_data[‘temperature’], weather_data[‘humidity’], weather_data[‘wind_speed’], weather_data[‘description’]])

#Output
Location,Temperature,Humidity,Wind Speed,Description
London,12.23,62,6.69,scattered clouds

As you can see, the code sends a request to the OpenWeatherMap API to retrieve weather data for London, UK. The response is in JSON format, which is then parsed to extract the relevant weather data. The data is then saved to a CSV file with headers and a single row of weather data. This CSV file can then be used for further analysis or combined with other datasets.

1.2 Data Preparation

Once the data has been collected, it needs to be cleaned and pre-processed. This involves removing duplicates, filling in missing values, and correcting errors.

Here’s an example of data preparation using Python code

Data Set Link: https://github.com/Narenderbeniwal/Spark-By-Example

import pandas as pd
from sklearn.preprocessing import StandardScaler

# Load the dataset into a Pandas DataFrame
data = pd.read_csv(“iris.csv”)

# Drop the ID column
data = data.drop(columns=[“Id”])

# Check for missing values
print(“Missing values:”)
print(data.isnull().sum())

# Scale the numerical features
scaler = StandardScaler()
numerical_cols = [“SepalLengthCm”, “SepalWidthCm”, “PetalLengthCm”, “PetalWidthCm”]
data[numerical_cols] = scaler.fit_transform(data[numerical_cols])

# Convert the categorical feature to numerical using one-hot encoding
data = pd.get_dummies(data, columns=[“Species”])

# Split the dataset into training and testing sets
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(
data.drop(columns=[“Species_Iris-setosa”, “Species_Iris-versicolor”, “Species_Iris-virginica”]),
data[[“Species_Iris-setosa”, “Species_Iris-versicolor”, “Species_Iris-virginica”]],
test_size=0.2, random_state=42)

# Save the preprocessed data to a new CSV file
data.to_csv(“preprocessed_iris.csv”, index=False)

# Print the first 5 rows of the preprocessed data
print(data.head())

#Output
Missing values:
SepalLengthCm 0
SepalWidthCm 0
PetalLengthCm 0
PetalWidthCm 0
Species 0
dtype: int64
SepalLengthCm SepalWidthCm PetalLengthCm PetalWidthCm
0 -0.900681 1.032057 -1.341272 -1.312977
1 -1.143017 -0.124958 -1.341272 -1.312977
2 -1.385353 0.337848 -1.398138 -1.312977
3 -1.506521 0.106445 -1.284407 -1.312977
4 -1.021849 1.263460 -1.341272 -1.312977

Species_Iris-setosa Species_Iris-versicolor Species_Iris-virginica
0 1 0 0
1 1 0 0
2 1 0 0
3 1 0 0
4 1 0 0

In this example, we are preparing the famous iris dataset. We start by loading the data into a Pandas DataFrame and dropping the Id column, which is irrelevant to our analysis.

Next, we check for missing values using the isnull() and sum() methods. In this case, we don’t find any missing values, so we can move on to scaling the numerical features using StandardScaler from the sklearn.preprocessing module.

We then convert the categorical feature to numerical using one-hot encoding with the get_dummies method of Pandas DataFrame.

We then split the dataset into training and testing sets using train_test_split from sklearn.model_selection module.

Finally, we save the preprocessed data to a new CSV file using the to_csv method and print the first 5 rows of the preprocessed data using the head() method.

Note that the specifics of data preparation will depend on the specific dataset and analysis you are performing and may require different techniques or tools.

1.3 Data integration:

This involves combining data from multiple sources to create a comprehensive dataset that can be used to train the machine learning model.

Here’s an example of data intergration using Python code.

import pandas as pd

# Load the first CSV file into a DataFrame
df1 = pd.read_csv(‘file1.csv’)

# Load the second CSV file into a DataFrame
df2 = pd.read_csv(‘file2.csv’)

# Merge the two DataFrames on a common column
merged_df = pd.merge(df1, df2, on=’common_column’)

# Save the merged DataFrame to a new CSV file
merged_df.to_csv(‘merged_file.csv’, index=False)

In this example, we are using the Pandas library to read in two CSV files, merge them on a common column, and then save the merged data to a new CSV file. The pd.read_csv() function is used to load the data from each file into a Pandas DataFrame, and the pd.merge() function is used to merge the DataFrames based on a common column. Finally, the to_csv() function is used to save the merged DataFrame to a new CSV file.

Note that the on parameter in the pd.merge() function specifies the name of the common column to merge the DataFrames on. If the common column has different names in the two DataFrames, you can specify them separately using the left_on and right_on parameters.

1.4 Data normalization

Data normalization is a crucial step in data processing for machine learning. It helps to scale the data to a standard range, such as between 0 and 1 or -1 and 1. This ensures that the impact of the magnitude of the variables on the model is minimized. Here’s an example of data normalization using Python code:

#Output
import pandas as pd
from sklearn.preprocessing import MinMaxScaler

# Load the dataset
data = pd.read_csv(‘dataset.csv’)

# Extract the features to be normalized
X = data.iloc[:, :-1].values

# Normalize the features using MinMaxScaler
scaler = MinMaxScaler()
normalized_X = scaler.fit_transform(X)

# Replace the original features with the normalized features
data.iloc[:, :-1] = normalized_X

# Save the processed data
data.to_csv(‘processed_data.csv’, index=False)

In this example, we first load the dataset using the panda’s library. Then, we extract the features to be normalized and store them in the X variable. We use the MinMaxScaler class from the sklearn.preprocessing module to normalize the features. This scales the data to a range between 0 and 1. We replace the original features in the dataset with the normalized features and save the processed data in a CSV file.

1.5 Feature selection and Extraction

Feature selection and extraction are important steps in data processing for machine learning. They involve identifying the most relevant features that have the most significant impact on the model’s output. Here’s an example of feature selection and extraction using Python code:

import pandas as pd
from sklearn.feature_selection import SelectKBest, chi2
from sklearn.decomposition import PCA

# Load the dataset
data = pd.read_csv(‘dataset.csv’)

# Extract the features and target variable
X = data.iloc[:, :-1].values
y = data.iloc[:, -1].values

# Select the top K features using SelectKBest and chi-squared test
selector = SelectKBest(chi2, k=5)
selected_X = selector.fit_transform(X, y)

# Extract the top principal components using PCA
pca = PCA(n_components=3)
extracted_X = pca.fit_transform(X)

# Save the processed data
processed_data = pd.DataFrame(extracted_X, columns=[‘PC1’, ‘PC2’, ‘PC3’])
processed_data[‘target’] = y
processed_data.to_csv(‘processed_data.csv’, index=False)

In this example, we first load the dataset using the pandas library. Then, we extract the features and target variable and store them in the X and y variables, respectively. We use the SelectKBest class from the sklearn.feature_selection module to select the top 5 features using the chi-squared test. This selects the most relevant features that are most correlated with the target variable. We then use the PCA class from the sklearn.decomposition module to extract the top 3 principal components. This reduces the dimensionality of the dataset and retains the most important information. Finally, we save the processed data in a CSV file.

1.6 Data splitting

Data splitting is an important step in data processing for machine learning. It involves dividing the processed data into training and testing datasets. The training data is used to train the model, while the testing data is used to evaluate the performance of the model. Here’s an example of data splitting using Python code:

import pandas as pd
from sklearn.model_selection import train_test_split

# Load the dataset
data = pd.read_csv(‘processed_data.csv’)

# Extract the features and target variable
X = data.iloc[:, :-1].values
y = data.iloc[:, -1].values

# Split the data into training and testing sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

# Save the split data
train_data = pd.DataFrame(X_train, columns=[‘feature1’, ‘feature2’, ‘feature3’])
train_data[‘target’] = y_train
train_data.to_csv(‘train_data.csv’, index=False)

test_data = pd.DataFrame(X_test, columns=[‘feature1’, ‘feature2’, ‘feature3’])
test_data[‘target’] = y_test
test_data.to_csv(‘test_data.csv’, index=False)

In this example, we first load the processed data using the pandas library. Then, we extract the features and target variable and store them in the X and y variables, respectively. We use the train_test_split function from the sklearn.model_selection module to split the data into training and testing sets. The test_size parameter specifies the percentage of the data that should be allocated for testing. The random_state parameter ensures that the data is split in the same way every time the code is run. Finally, we save the split data into separate CSV files for training and testing datasets.

2. Why Data Preprocessing?

Data preprocessing helps to improve the quality of the data, reduce the dimensionality of the dataset, and prepare the data for training and testing machine learning models. By processing the data, we can identify and remove errors, standardize the data, and extract the most relevant features for use in building accurate and efficient models.

3. What after Data Preprocessing?

After data preprocessing, we can move on to the next steps in the machine learning pipeline, such as selecting a machine learning algorithm, splitting the data into training and testing datasets, training the model, evaluating its performance, tuning hyperparameters, and deploying the model.

4. Conclusion

Overall, data preprocessing is an essential step in building accurate and effective machine learning models, and it requires careful attention to detail to ensure that the data is processed appropriately and efficiently

 Data Preprocessing includes the steps we need to follow to transform or encode data so that it may be easily parsed by the machine.  The main agenda for a model to be accurate and precise in predictions is that the algorithm should be able to easily interpret the data’s features.  1. What is Data Processing  Read More Machine Learning 

Leave a Reply

Your email address will not be published. Required fields are marked *