Hello all, welcome to another machine learning tutorial. Here, we will learn about Random forest classification model. This article is quite similar to all previous classification articles because we are simply using new python libraries for classifiers and we are not changing the way data preprocessing and graphs are plotted.
Random Forest Classification:
Random Forest classification works on the same concept of Random Forest Regression.
Consider our example from logistic regression, where we want to know whether a new user will buy the car or not.
# Importing the libraries
import numpy as np
import matplotlib.pyplot as plt
import pandas as pd
# Importing the dataset
dataset = pd.read_csv('Social_Network_Ads.csv')
X = dataset.iloc[:, [2, 3]].values
y = dataset.iloc[:, 4].values
# Splitting the dataset into the Training set and Test set
from sklearn.cross_validation import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size = 0.25, random_state = 0)
# Feature Scaling
from sklearn.preprocessing import StandardScaler
sc = StandardScaler()
X_train = sc.fit_transform(X_train)
X_test = sc.transform(X_test)
- We read “Social_Network_Ads.csv” file and stored in dataset.
- Extracted age and salary information from dataset and stored in X
- Extracted purchase information from dataset and stored in Y.
- Split dataset in training and test set so that machine can be trained using X_train and Y_train
- Used feature scaling for X_train.
Random Forest Classification:
from sklearn.ensemble import RandomForestClassifier
classifier = RandomForestClassifier(n_estimators = 10, criterion = 'entropy', random_state = 0)
- Imported RandomForestClassifier from sklearn.ensemble
- Created a classifier and
- provided number of estimators as 10, i.e. our forest will compose of 10 decision trees
- applied a widely used ‘entropy’ criterion to it
- fitted classifier with training data set
Plotting the Graph:
from matplotlib.colors import ListedColormap
X_set, y_set = X_test, y_test
aranged_ages = np.arange(start = X_set[:, 0].min(), stop = X_set[:, 0].max(), step = 0.01)
aranged_salaries = np.arange(start = X_set[:, 1].min(), stop = X_set[:, 1].max(), step = 0.01)
X1, X2 = np.meshgrid(aranged_ages, aranged_salaries)
plt.contourf(X1, X2, classifier.predict(np.array([X1.ravel(), X2.ravel()]).T).reshape(X1.shape),
alpha = 0.5, cmap = ListedColormap(('orange', 'blue')))
- aranged_ages variable will have scaled ages of users starting from minimum age to maximum age incremented by 0.01.
- aranged_salaries variable will have scaled salaries of users starting from minimum salary to maximum salary incremented by 0.01.
- np.meshgrid() takes aranged_ages and aranged_salaries to form X1 and X2.
- X1 and X2 are used for creating a graph which classifies all data points using random forest classification. It is done using plt.contourf(), method.
- Random forest classification takes number of decision trees as input parameters.
- It takes predictions of all decision trees and chooses mostly predicted results from input decision trees.
- Mostly predicted results are chosen as final results and they are plotted on graph as above
- orange section is of users who will not buy the car and blue section is for users who will buy the car
Plotting Test set:
for i, j in enumerate(np.unique(y_set)):
plt.scatter(X_set[y_set == j, 0], X_set[y_set == j, 1],
c = ListedColormap(('red', 'green'))(i), label = j)
plt.title('Random Forest Classification (Test set)')
Above code plots actual data points in classification.
- Red points denote users who did not buy the car
- Green points denote users who bought the car.
Note that we have plotted 100 observations from our test set and out of them
- 3 green points are observed on orange area
- 5 red points are observed in blue area
This means, out of 100 observation points, random forest classification predicted 92 results correctly and only 8 are incorrect.
I hope this helped. Happy learning 🙂