Data preprocessing is a process of preparing the raw data and making it suitable for a machine learning model. It is the first and crucial step while creating a prediction model in machine learning.
In this tutorial, we understand the preprocessing in data science through an example. For performing example, I have taken a small dataset which is looked as follows:
Page Contents
Understanding the tutorial step by step:
Remember, in order to make you understand the python code, I will use the comments and # symbol is used in python to make the comments.
1. First step: Import the libraries
import numpy as np
import matplotlib.pyplot as plt
import pandas as pd
2. Second Step: Importing the dataset
The data frame is initialized to the variable dataset. Now X is used to store the first three columns and Y for the last column. In most of the project, the last column is treated as dependent.
dataset = pd.read_csv(‘Data.csv’)
X = dataset.iloc[:, :-1].values # In brackets [ ], the values before comma refers to rows and values after comma refers to columns.
y = dataset.iloc[:, -1].values
print (X)
print (Y)
3. Third Step: Taking care of missing data
As we have known some values in the dataset are missing which needs to take care for further processing. These missing values can be converted to the mean value of that column.
from sklearn.impute import SimpleImputer
imputer = SimpleImputer(missing_values=np.nan, strategy=’mean’)
imputer.fit(X[:, 1:3])
X[:, 1:3] = imputer.transform(X[:, 1:3])
print(X)
4. Fourth Step: Encoding categorical data
Encoding the Independent Variable
from sklearn.compose import ColumnTransformer
from sklearn.preprocessing import OneHotEncoder
ct = ColumnTransformer(transformers=[(‘encoder’, OneHotEncoder(), [0])], remainder=’passthrough’)
X = np.array(ct.fit_transform(X))
print(X)
Encoding the Dependent Variable
from sklearn.preprocessing import LabelEncoder
le = LabelEncoder()
y = le.fit_transform(y)
print(y)
5. Fifth Step: Splitting the dataset into the Training set and Test set
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size = 0.2, random_state = 1)
print(X_train)
print(X_test)
print(y_train)
print(y_test)
6. Sixth Step: Feature Scaling
from sklearn.preprocessing import StandardScaler
sc = StandardScaler()
X_train[:, 3:] = sc.fit_transform(X_train[:, 3:])
X_test[:, 3:] = sc.transform(X_test[:, 3:])
print(X_train)
print(X_test)