From this aricle onwards we are going to build Machine Learning models.But,before that,we will be going through some basic concepts of machine learning.In this article we are going see how data preprocessing is done for machine learning model.So,let’s get started.
You can create any dataset for your practice.Here for this tutorial series we are going to create a dataset which will determine whether a candidate will get hired or not.This dataset will contain following columns:
- Role : Designation of a candidate to be hired
- Age : Age of the candidate
- Experience : Work experience in years
- Hire : This column will have only two values either yes or no
Create a CSV file using above data.
We are going to use below three most essential libraries for data preprocessing:
- numpy : numpy is a library which supports mutlidimensional arrays in Python
- matplotlib : matplotlib is used for plotting charts in Python
- pandas : panadas provides expressive data structures for relational data
Using these libraries we are going to read our dataset for data preprocessing.Below is the code snippet to read dataset using python libraries. Please keep Data.csv and data_preprocessing.py in same directory.
import numpy as np
import matplotlib.pyplot as plt
import pandas as pd
dataset = pd.read_csv('Data.csv')
x = dataset.iloc[:, :-1].values
y = dataset.iloc[:,3].values
To read the dataset we have imported the three libraries explained above.To read the dataset from Data.csv file,we have created dataset as an object of library pandas.read_csv is the function in pandas library which supports to read a csv file.
Using numpy library we are going to create two matrix from our dataset.The first matrix will contain first three columns stored in matrix x and the last column i.e.Hire column will be stored in matrix y.
To separate dataset into two matrices we have used iloc function from pandas library.We created two separate matrices from dataset because the value of last column is dependent on the values from first three columns.this implies that x is an independent variable whereas y in a dependent variable which values are dependent on x.
Run Data Preprocessing Program:
Select all lines and press ctrl + Enter to run our program and we will see variable values can be seen in spyder variable explorer. In console, type x, to see actual values populated in x. We can do this for any variable.
Here, we can see x matrix consists of three columns, i.e. role, age and experience.
Handle Missing Data In Python
In our dataset you can see there are two missing values in age and experience column.To handle these missing values Python uses sklearn library.How does this library handles missing data?In python,to handle missing data mean of all the other values present in that column is calculated.For example,in our case to the missing value from the column age will get calculated from the average of other age values present in the same column.Below is the code snippet for same.
from sklearn.preprocessing import Imputer
imputer = Imputer (missing_values= 'NaN', strategy = 'mean', axis = 0)
imputer = imputer.fit(x[:,1:3])
x[:, 1:3] = imputer.transform(x[:, 1:3])
In above code snippet you can see we used sklearn.preprocessing library. Imputer is a class present in sklearn library which handles missing values from a dataset.
We have created an object of Imputer class as imputer.This class has many attributes to handle missing values.Here,we have used only 3 of them.To know more about Imputer class, select the Imputer word in your code snippet and press ctrl+i. This will lead you to help page in Python where you can read about the class we have used.
When axis=0,the columns which contained missing values at fit are discarded upon transform.Select and run the program.you will get following output.
Now,in next section we will see how to categorized the dataset.
Encode Categorical Data
Categorical data is data which have categories like in our dataset the columns Role and Hire have categories like development,Tester and yes or no respectively.As machine learnign models are based on mathematical equations,it will be difficult to a machine to read the text values.Hence,we need to encode the categorical data in our dataset.
For this the same library from missing values i.e.sklearn will be used for encoding categorical data.Following is code snippet.
#Encoding Categorical Data
from sklearn.preprocessing import LabelEncoder, OneHotEncoder
labelEncoder_X = LabelEncoder()
x[:, 0] = labelEncoder_X.fit_transform(x[:, 0])
oneHotEncoder = OneHotEncoder(categorical_features=)
x = oneHotEncoder.fit_transform(x).toarray()
labelEncoder_Y = LabelEncoder()
y = labelEncoder_Y.fit_transform(y)
We have used following classes to encode categorical data:
- LabelEncoder : Encode labels with values between 0 and n-1 i.e [size of array] – 1
- OneHotEncoder : Encode categorical integer features using a one-hot aka one-of-K scheme.
Run the program and type x to see values in x and y to see values in y.
Here we can see that values from column Role has been changed into numbers.These are the encoded number for text values in same column.For example,0 indicates the role Developer,2 indicates the role Tester,3 UI Designer,1 Manager.Same will be done for variable y.
Out: array([1, 1, 0, 1, 1, 0, 1, 0, 0, 1], dtype=int64)
This is it!We have encoded the categorical data.Now,we will spilt data into Train and Test data.
Split Dataset Into Train and Test
We will split dataset into train and test dataset for machine learning models we are going to build in future.Splinting the data into two parts i.e. Train and Test so that the test data we can use to test the machine learning model and train data will be the training data for machine learning model.
To split the dataset we have again used the same library i.e sklearn.We have used train_test_split class which splits some % data into train and test.The best practice is to split 20% of your dataset into test data.Following code snippet will do the same.
#Splitting Dataset into Training set and Test set
from sklearn.cross_validation import train_test_split
X_train,X_test,Y_train,Y_test = train_test_split(x,y,test_size = 0.2,random_state = 0)
This will split the dataset into Train and 20% test dataset i.e.The Test dataset will have 2 records from our dataset.We have done data preprocessing for upcoming machine learning article series. Happy learning 🙂