# Logistic Regression Classification

#### Latest posts by Prasad Kharkar (see all)

- PyCharm for Machine Learning - July 17, 2018
- Linear Discriminant Analysis using Python - April 30, 2018
- Principal Component Analysis using Python - April 30, 2018

Hello all, welcome to another tutorial in machine learning. Till now ,we learned about many regression models which had independent variables and dependent variable. Dependent variable is result of independent variables. For example, we predicted salary based on experience, qualifications etc. However, we may come across some observations where dependent variable gets classified into specific values. We need classification mechanism for this kind of machine learning models.

From this article onward, we will learn about classification mechanisms and python libraries . This article will focus on logistic regression classification.

# Logistic Regression Classification:

Logistic Regression classification can predict results from observation and classify appropriately. We will use the same dataset for logistic regression taken from www.superdatascience.com/machine-learning

## Problem Statement:

Consider we have data set of users of a social networking site with **userId, gender, age, salary. **We are a car manufacturing company and have launched a great car. We want to know whether users will buy our car or not.

## Dataset:

We have some data from observations as below. It has user id, gender, age and salary of a person and whether he has purchased the car or not.Full data set is available here.

user id | gender | age | salary | purchased |

15628523 | Male | 35 | 39000 | 0 |

15708196 | Male | 49 | 74000 | 0 |

15735549 | Female | 39 | 134000 | 1 |

15809347 | Female | 41 | 71000 | 0 |

15660866 | Female | 58 | 101000 | 1 |

## Logistic Regression Classification:

### Data Preprocessing:

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 |
import numpy as np import matplotlib.pyplot as plt import pandas as pd # Importing the dataset dataset = pd.read_csv('Social_Network_Ads.csv') X = dataset.iloc[:, [2, 3]].values y = dataset.iloc[:, 4].values # Splitting the dataset into the Training set and Test set from sklearn.cross_validation import train_test_split X_train, X_test, y_train, y_test = train_test_split(X, y, test_size = 0.25, random_state = 0) # Feature Scaling from sklearn.preprocessing import StandardScaler sc = StandardScaler() X_train = sc.fit_transform(X_train) X_test = sc.transform(X_test) |

- We read “Social_Network_Ads.csv” file and stored in dataset.
- Extracted age and salary information from dataset and stored in X
- Extracted purchase information from dataset and stored in Y.
- Split dataset in training and test set so that machine can be trained using X_train and Y_train and y_test can be compared with y_pred.
- Used feature scaling for X_train.

### Logistic Regression:

1 2 3 4 5 6 7 |
# Fitting Logistic Regression to the Training set from sklearn.linear_model import LogisticRegression classifier = LogisticRegression(random_state = 0) classifier.fit(X_train, y_train) # Predicting the Test set results y_pred = classifier.predict(X_test) |

- We imported LogisticRegression class from sklearn.linear library.
- Created classifier as object of
**LogisticRegression**. - Fitted training data into classifier.
- Predicted results for
**X_test**and stored in**y_pred**

We have our predictions ready.

### Plotting the graph:

1 2 3 4 5 6 7 8 9 |
from matplotlib.colors import ListedColormap X_set, y_set = X_test, y_test aranged_ages = np.arange(start = X_set[:, 0].min(), stop = X_set[:, 0].max(), step = 0.01) aranged_salaries = np.arange(start = X_set[:, 1].min(), stop = X_set[:, 1].max(), step = 0.01) X1, X2 = np.meshgrid(aranged_ages, aranged_salaries) plt.contourf(X1, X2, classifier.predict(np.array([X1.ravel(), X2.ravel()]).T).reshape(X1.shape), alpha = 0.5, cmap = ListedColormap(('orange', 'blue'))) |

I know that is lot of code for plotting a graph but I will try to explain what is being done here.

**aranged_ages**variable will have scaled ages of users starting from minimum age to maximum age incremented by 0.01.**aranged_salaries**variable will have scaled salaries of users starting from minimum salary to maximum salary incremented by 0.01.- np.meshgrid() takes aranged_ages and aranged_salaries to form X1 and X2.
- X1 and X2 are used for creating a graph which classifies all data points using logistic regression classification. It is done using
**plt.contourf(),**method.

After this graph is plotted as below

- Note the orange and blue sections in graph.
- Logistic Regression classification has classified all data points into two classes, one who will not buy the car and the one who buys it.
- Orange section denotes all users who will not buy the car
- Blue section denotes all users who will buy the car.

Now, these are predictions of logistic regression classification. Let us plot actual observations on this graph and we will compare results

1 2 3 4 5 6 7 8 9 10 |
plt.xlim(X1.min(), X1.max()) plt.ylim(X2.min(), X2.max()) for i, j in enumerate(np.unique(y_set)): plt.scatter(X_set[y_set == j, 0], X_set[y_set == j, 1], c = ListedColormap(('red', 'green'))(i), label = j) plt.title('Logistic Regression (Test set)') plt.xlabel('Age') plt.ylabel('Salary') plt.legend() plt.show() |

These lines will plot actual data points.

- Red points denote users who did not buy the car
- Green points denote users who bought the car.

Ok Let’s go.

Note that we have plotted 100 observations from our test set and out of them

- Only 8 green points are observed on orange area
- Only 3 red points are observed in blue area

This means, out of 100 observation points, logistic regression classification predicted 89 results correctly and only 11 are incorrect.

I hope this helped. Happy learning 🙂

Pingback:K Nearest Neighbors classification - theJavaGeek

Pingback:SVM Classification - theJavaGeek

Pingback:Naive Bayes Classification - theJavaGeek

Pingback:Decision Tree Classification - theJavaGeek

Pingback:Random Forest Classification - theJavaGeek

I remember learning about logistic regression in school via scientific calculator, but I’ve never learned how to do with a code. This is an important skill for businesses professionals to know, if they want a stronger background in analytics consulting. Thank you for sharing, I’m definitely a happy learner now.

Pingback:Principal Component Analysis using Python - theJavaGeek

Pingback:Linear Discriminant Analysis using Python - theJavaGeek

could you please explain this part of the code ?

for i, j in enumerate(np.unique(y_set)):

plt.scatter(X_set[y_set == j, 0], X_set[y_set == j, 1],

c = ListedColormap((‘red’, ‘green’))(i), label = j)