Hello all, welcome to another tutorial in machine learning. Till now ,we learned about many regression models which had independent variables and dependent variable. Dependent variable is result of independent variables. For example, we predicted salary based on experience, qualifications etc. However, we may come across some observations where dependent variable gets classified into specific values. We need classification mechanism for this kind of machine learning models.
From this article onward, we will learn about classification mechanisms and python libraries . This article will focus on logistic regression classification.
Logistic Regression Classification:
Logistic Regression classification can predict results from observation and classify appropriately. We will use the same dataset for logistic regression taken from www.superdatascience.com/machine-learning
Consider we have data set of users of a social networking site with userId, gender, age, salary. We are a car manufacturing company and have launched a great car. We want to know whether users will buy our car or not.
We have some data from observations as below. It has user id, gender, age and salary of a person and whether he has purchased the car or not.Full data set is available here.
Logistic Regression Classification:
import numpy as np
import matplotlib.pyplot as plt
import pandas as pd
# Importing the dataset
dataset = pd.read_csv('Social_Network_Ads.csv')
X = dataset.iloc[:, [2, 3]].values
y = dataset.iloc[:, 4].values
# Splitting the dataset into the Training set and Test set
from sklearn.cross_validation import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size = 0.25, random_state = 0)
# Feature Scaling
from sklearn.preprocessing import StandardScaler
sc = StandardScaler()
X_train = sc.fit_transform(X_train)
X_test = sc.transform(X_test)
- We read “Social_Network_Ads.csv” file and stored in dataset.
- Extracted age and salary information from dataset and stored in X
- Extracted purchase information from dataset and stored in Y.
- Split dataset in training and test set so that machine can be trained using X_train and Y_train and y_test can be compared with y_pred.
- Used feature scaling for X_train.
# Fitting Logistic Regression to the Training set
from sklearn.linear_model import LogisticRegression
classifier = LogisticRegression(random_state = 0)
# Predicting the Test set results
y_pred = classifier.predict(X_test)
- We imported LogisticRegression class from sklearn.linear library.
- Created classifier as object of LogisticRegression.
- Fitted training data into classifier.
- Predicted results for X_test and stored in y_pred
We have our predictions ready.
Plotting the graph:
from matplotlib.colors import ListedColormap
X_set, y_set = X_test, y_test
aranged_ages = np.arange(start = X_set[:, 0].min(), stop = X_set[:, 0].max(), step = 0.01)
aranged_salaries = np.arange(start = X_set[:, 1].min(), stop = X_set[:, 1].max(), step = 0.01)
X1, X2 = np.meshgrid(aranged_ages, aranged_salaries)
plt.contourf(X1, X2, classifier.predict(np.array([X1.ravel(), X2.ravel()]).T).reshape(X1.shape),
alpha = 0.5, cmap = ListedColormap(('orange', 'blue')))
I know that is lot of code for plotting a graph but I will try to explain what is being done here.
- aranged_ages variable will have scaled ages of users starting from minimum age to maximum age incremented by 0.01.
- aranged_salaries variable will have scaled salaries of users starting from minimum salary to maximum salary incremented by 0.01.
- np.meshgrid() takes aranged_ages and aranged_salaries to form X1 and X2.
- X1 and X2 are used for creating a graph which classifies all data points using logistic regression classification. It is done using plt.contourf(), method.
After this graph is plotted as below
- Note the orange and blue sections in graph.
- Logistic Regression classification has classified all data points into two classes, one who will not buy the car and the one who buys it.
- Orange section denotes all users who will not buy the car
- Blue section denotes all users who will buy the car.
Now, these are predictions of logistic regression classification. Let us plot actual observations on this graph and we will compare results
for i, j in enumerate(np.unique(y_set)):
plt.scatter(X_set[y_set == j, 0], X_set[y_set == j, 1],
c = ListedColormap(('red', 'green'))(i), label = j)
plt.title('Logistic Regression (Test set)')
These lines will plot actual data points.
- Red points denote users who did not buy the car
- Green points denote users who bought the car.
Ok Let’s go.
Note that we have plotted 100 observations from our test set and out of them
- Only 8 green points are observed on orange area
- Only 3 red points are observed in blue area
This means, out of 100 observation points, logistic regression classification predicted 89 results correctly and only 11 are incorrect.
I hope this helped. Happy learning 🙂