Hello all, welcome to another machine learning tutorial. Here, we will learn about K nearest neighbors classification model. We will take same example from logistic regression article.
K Nearest Neighbors Classification:
Let us understand what K Nearest Neighbors classification does.
- It chooses K number for a new data point
- Takes K nearest neighbors of newly considered data point according to distance.
- Count the number of nearest neighbors for each category.
- Place data point to category with most number of nearest neighbors.
Consider our example where we want to know whether a new user will buy the car or not.
- Choose number of neighbors for this user, e.g. 5
- check all data points which are nearest to this user. So we will have 5 nearest neighbors
- There are 3 users who have bought the car and 2 users who have not bought the car.
- Classify this user as a buyer because 3 out of 5 neighbors have already bought it.
# Importing the libraries
import numpy as np
import matplotlib.pyplot as plt
import pandas as pd
# Importing the dataset
dataset = pd.read_csv('Social_Network_Ads.csv')
X = dataset.iloc[:, [2, 3]].values
y = dataset.iloc[:, 4].values
# Splitting the dataset into the Training set and Test set
from sklearn.cross_validation import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size = 0.25, random_state = 0)
# Feature Scaling
from sklearn.preprocessing import StandardScaler
sc = StandardScaler()
X_train = sc.fit_transform(X_train)
X_test = sc.transform(X_test)
- We read “Social_Network_Ads.csv” file and stored in dataset.
- Extracted age and salary information from dataset and stored in X
- Extracted purchase information from dataset and stored in Y.
- Split dataset in training and test set so that machine can be trained using X_train and Y_train and y_test can be compared with y_pred.
- Used feature scaling for X_train.
K Nearest Neighbors Classification:
# Fitting K-NN to the Training set
from sklearn.neighbors import KNeighborsClassifier
classifier = KNeighborsClassifier()
- Imported KNeighborsClassifier from sklearn.neighbors
- Created a classifier with all default paramaters,
- KNeighborsClassifier takes 5 nearest neighbors by default
- it uses euclidean distance by default.
- Fit training set to classifier.
Plotting the Graph:
from matplotlib.colors import ListedColormap
X_set, y_set = X_test, y_test
aranged_ages = np.arange(start = X_set[:, 0].min(), stop = X_set[:, 0].max(), step = 0.01)
aranged_salaries = np.arange(start = X_set[:, 1].min(), stop = X_set[:, 1].max(), step = 0.01)
X1, X2 = np.meshgrid(aranged_ages, aranged_salaries)
plt.contourf(X1, X2, classifier.predict(np.array([X1.ravel(), X2.ravel()]).T).reshape(X1.shape),
alpha = 0.5, cmap = ListedColormap(('orange', 'blue')))
- aranged_ages variable will have scaled ages of users starting from minimum age to maximum age incremented by 0.01.
- aranged_salaries variable will have scaled salaries of users starting from minimum salary to maximum salary incremented by 0.01.
- np.meshgrid() takes aranged_ages and aranged_salaries to form X1 and X2.
- X1 and X2 are used for creating a graph which classifies all data points using K nearest neighbors classification. It is done using plt.contourf(), method.
- Note the orange and blue sections in graph.
- K Nearest Neighbors classification has created two classes which are clearly non linear.
- So unlike logistic regression, this is a non linear regression.
- Orange section denotes all users who will not buy the car
- Blue section denotes all users who will buy the car.
Plotting Test set:
for i, j in enumerate(np.unique(y_set)):
plt.scatter(X_set[y_set == j, 0], X_set[y_set == j, 1],
c = ListedColormap(('red', 'green'))(i), label = j)
plt.title('K Nearest Neighbors (Test set)')
Above code plots actual data points in classification.
- Red points denote users who did not buy the car
- Green points denote users who bought the car.
Note that we have plotted 100 observations from our test set and out of them
- Only 3 green points are observed on orange area
- Only 4 red points are observed in blue area
This means, out of 100 observation points, K nearest neighbors classification predicted 93 results correctly and only 7 are incorrect.
I hope this helped. Happy learning 🙂