# K Nearest Neighbors classification

#### Latest posts by Prasad Kharkar (see all)

- PyCharm for Machine Learning - July 17, 2018
- Linear Discriminant Analysis using Python - April 30, 2018
- Principal Component Analysis using Python - April 30, 2018

Hello all, welcome to another machine learning tutorial. Here, we will learn about K nearest neighbors classification model. We will take same example from logistic regression article.

# K Nearest Neighbors Classification:

Let us understand what K Nearest Neighbors classification does.

- It chooses K number for a new data point
- Takes K nearest neighbors of newly considered data point according to distance.
- Count the number of nearest neighbors for each category.
- Place data point to category with most number of nearest neighbors.

Consider our example where we want to know whether a new user will buy the car or not.

- Choose number of neighbors for this user, e.g. 5
- check all data points which are nearest to this user. So we will have 5 nearest neighbors
- There are 3 users who have bought the car and 2 users who have not bought the car.
- Classify this user as a buyer because 3 out of 5 neighbors have already bought it.

## Data Preprocessing:

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 |
# Importing the libraries import numpy as np import matplotlib.pyplot as plt import pandas as pd # Importing the dataset dataset = pd.read_csv('Social_Network_Ads.csv') X = dataset.iloc[:, [2, 3]].values y = dataset.iloc[:, 4].values # Splitting the dataset into the Training set and Test set from sklearn.cross_validation import train_test_split X_train, X_test, y_train, y_test = train_test_split(X, y, test_size = 0.25, random_state = 0) # Feature Scaling from sklearn.preprocessing import StandardScaler sc = StandardScaler() X_train = sc.fit_transform(X_train) X_test = sc.transform(X_test) |

- We read “Social_Network_Ads.csv” file and stored in dataset.
- Extracted age and salary information from dataset and stored in X
- Extracted purchase information from dataset and stored in Y.
- Split dataset in training and test set so that machine can be trained using X_train and Y_train and y_test can be compared with y_pred.
- Used feature scaling for X_train.

### K Nearest Neighbors Classification:

1 2 3 4 |
# Fitting K-NN to the Training set from sklearn.neighbors import KNeighborsClassifier classifier = KNeighborsClassifier() classifier.fit(X_train, y_train) |

- Imported KNeighborsClassifier from sklearn.neighbors
- Created a classifier with all default paramaters,
- KNeighborsClassifier takes 5 nearest neighbors by default
- it uses euclidean distance by default.

- Fit training set to classifier.

### Plotting the Graph:

1 2 3 4 5 6 7 8 9 |
from matplotlib.colors import ListedColormap X_set, y_set = X_test, y_test aranged_ages = np.arange(start = X_set[:, 0].min(), stop = X_set[:, 0].max(), step = 0.01) aranged_salaries = np.arange(start = X_set[:, 1].min(), stop = X_set[:, 1].max(), step = 0.01) X1, X2 = np.meshgrid(aranged_ages, aranged_salaries) plt.contourf(X1, X2, classifier.predict(np.array([X1.ravel(), X2.ravel()]).T).reshape(X1.shape), alpha = 0.5, cmap = ListedColormap(('orange', 'blue'))) |

**aranged_ages**variable will have scaled ages of users starting from minimum age to maximum age incremented by 0.01.**aranged_salaries**variable will have scaled salaries of users starting from minimum salary to maximum salary incremented by 0.01.- np.meshgrid() takes aranged_ages and aranged_salaries to form X1 and X2.
- X1 and X2 are used for creating a graph which classifies all data points using K nearest neighbors classification. It is done using
**plt.contourf(),**method.

- Note the orange and blue sections in graph.
- K Nearest Neighbors classification has created two classes which are clearly non linear.
- So unlike logistic regression, this is a non linear regression.
- Orange section denotes all users who will not buy the car
- Blue section denotes all users who will buy the car.

### Plotting Test set:

1 2 3 4 5 6 7 8 9 10 |
plt.xlim(X1.min(), X1.max()) plt.ylim(X2.min(), X2.max()) for i, j in enumerate(np.unique(y_set)): plt.scatter(X_set[y_set == j, 0], X_set[y_set == j, 1], c = ListedColormap(('red', 'green'))(i), label = j) plt.title('K Nearest Neighbors (Test set)') plt.xlabel('Age') plt.ylabel('Salary') plt.legend() plt.show() |

Above code plots actual data points in classification.

- Red points denote users who did not buy the car
- Green points denote users who bought the car.

Note that we have plotted 100 observations from our test set and out of them

- Only 3 green points are observed on orange area
- Only 4 red points are observed in blue area

This means, out of 100 observation points, K nearest neighbors classification predicted 93 results correctly and only 7 are incorrect.

I hope this helped. Happy learning 🙂