Explaining Data Preprocessing with Python code | Data Science.


Data preprocessing is a process of preparing the raw data and making it suitable for a machine learning model. It is the first and crucial step while creating a prediction model in machine learning.

In this tutorial, we understand the preprocessing in data science through an example. For performing example, I have taken a small dataset which is looked as follows:











Understanding the tutorial step by step:

Remember, in order to make you understand the python code, I will use the comments and # symbol is used in python to make the comments.

 

1. First step: Import the libraries

import numpy as np

import matplotlib.pyplot as plt

import pandas as pd

 

2. Second Step: Importing the dataset

The data frame is initialized to the variable dataset. Now X is used to store the first three columns and Y for the last column. In most of the project, the last column is treated as dependent.

dataset = pd.read_csv(‘Data.csv’)

X = dataset.iloc[:, :-1].values  # In brackets [ ], the values before comma refers to rows and values after comma refers to columns.

y = dataset.iloc[:, -1].values

print (X)








print (Y)

 


3. Third Step: Taking care of missing data

As we have known some values in the dataset are missing which needs to take care for further processing. These missing values can be converted to the mean value of that column.

from sklearn.impute import SimpleImputer

imputer = SimpleImputer(missing_values=np.nan, strategy=’mean’)

imputer.fit(X[:, 1:3])

X[:, 1:3] = imputer.transform(X[:, 1:3])

print(X)

 


4. Fourth Step: Encoding categorical data

Encoding the Independent Variable

from sklearn.compose import ColumnTransformer

from sklearn.preprocessing import OneHotEncoder

ct = ColumnTransformer(transformers=[(‘encoder’, OneHotEncoder(), [0])], remainder=’passthrough’)

X = np.array(ct.fit_transform(X))

print(X)

 


Encoding the Dependent Variable

from sklearn.preprocessing import LabelEncoder

le = LabelEncoder()

y = le.fit_transform(y)

print(y)

 


5. Fifth Step: Splitting the dataset into the Training set and Test set

from sklearn.model_selection import train_test_split

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size = 0.2, random_state = 1)

print(X_train)







print(X_test)




print(y_train)



print(y_test)

 


6. Sixth Step: Feature Scaling

from sklearn.preprocessing import StandardScaler

sc = StandardScaler()

X_train[:, 3:] = sc.fit_transform(X_train[:, 3:])

X_test[:, 3:] = sc.transform(X_test[:, 3:])

print(X_train)





print(X_test)



Leave a Comment

error

Enjoy this blog? Please spread the word :)