# Backward Elimination for multiple linear regression

#### Latest posts by Prasad Kharkar (see all)

- PyCharm for Machine Learning - July 17, 2018
- Linear Discriminant Analysis using Python - April 30, 2018
- Principal Component Analysis using Python - April 30, 2018

We learned about multiple linear regression and predicted values of dependent variables based on multiple independent variables. However how can we identify the impact made by a specific independent variable on dependent variable? We can follow backward elimination for multiple linear regression to identify independent variables which have most impact on dependent variables.

# Backward Elimination For Multiple Regression

Before we dive deeper, let us understand why we are doing this. Consider our example dataset

Department | WorkedHours | Certification | YearsExperience | Salary |

Development | 2300 | 0 | 1.1 | 39343 |

Testing | 2100 | 1 | 1.3 | 46205 |

Development | 2104 | 2 | 1.5 | 37731 |

UX | 1200 | 1 | 2 | 43525 |

Testing | 1254 | 2 | 2.2 | 39891 |

UX | 1236 | 1 | 2.9 | 56642 |

Development | 1452 | 2 | 3 | 60150 |

Testing | 1789 | 1 | 3.2 | 54445 |

UX | 1645 | 1 | 3.2 | 64445 |

UX | 1258 | 0 | 3.7 | 57189 |

Testing | 1478 | 3 | 3.9 | 63218 |

Development | 1257 | 2 | 4 | 55794 |

Development | 1596 | 1 | 4 | 56957 |

Testing | 1256 | 2 | 4.1 | 57081 |

UX | 1489 | 3 | 4.5 | 61111 |

Development | 1236 | 3 | 4.9 | 67938 |

Testing | 2311 | 2 | 5.1 | 66029 |

UX | 2245 | 3 | 5.3 | 83088 |

Development | 2365 | 1 | 5.9 | 81363 |

Development | 1500 | 3 | 6 | 93940 |

Testing | 1456 | 2 | 6.8 | 91738 |

Testing | 1760 | 1 | 7.1 | 98273 |

UX | 2400 | 4 | 7.9 | 101302 |

Development | 2148 | 3 | 8.2 | 113812 |

UX | 1450 | 2 | 8.7 | 109431 |

UX | 1000 | 4 | 9 | 105582 |

Testing | 1540 | 3 | 9.5 | 116969 |

Development | 1500 | 2 | 9.6 | 112635 |

Testing | 3000 | 4 | 10.3 | 122391 |

UX | 2100 | 3 | 10.5 | 121872 |

Note that we will split above dataset in train and test data to verify predictions. We have already done regressions with reference of above data

- Simple Linear Regression where salary is predicted only on years of experience
- Multiple Linear Regression where salary is predicted on years of experience, worked hours, certifications earned and job role.

We have below results for them

If we compare salaries predicted by simple linear regression and multiple linear regression,we can note that simple linear regression results are much closer to actual salaries. Now what could be the reason? We may have some columns which are not really significant to predict salaries. For example, number of certifications done may not impact much on salary. We want to make our model better by eliminating the variables which do not have much impact on salary. Here we need to do backward elimination for multiple linear regression.

## Backward Elimination:

- Select significance level
- Fit our model with all possible independent variables.
- Consider variable with highest p-value.
- If p-value is greater than significance level, remove variable
- Again fit the model without removed variable.

here, significance level and p-value are statistical terms. Just remember these terms for now as we do not want to go in details. Just note that our python libraries will provide us these values for our independent variables.

Now coming to our scenario, we want our salary predictions to be more accurate and we have to decide which independent variables to consider for making a final model.

## Backward Elimination For Multiple Linear Regression:

Please note that we are taking this program from previous article as we need values of predicted salaries **y_pred.**

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 |
# Multiple Linear Regression # Importing the libraries import numpy as np import matplotlib.pyplot as plt import pandas as pd # Importing the dataset dataset = pd.read_csv('Employee_Data.csv') X = dataset.iloc[:, :-1].values y = dataset.iloc[:, 4].values # Encoding categorical data from sklearn.preprocessing import LabelEncoder, OneHotEncoder labelencoder = LabelEncoder() X[:, 0] = labelencoder.fit_transform(X[:, 0]) onehotencoder = OneHotEncoder(categorical_features = [0]) X = onehotencoder.fit_transform(X).toarray() # Avoiding the Dummy Variable Trap X = X[:, 1:] # Splitting the dataset into the Training set and Test set from sklearn.cross_validation import train_test_split X_train, X_test, y_train, y_test = train_test_split(X, y, test_size = 1/3, random_state = 0) # Fitting Multiple Linear Regression to the Training set from sklearn.linear_model import LinearRegression regressor = LinearRegression() regressor.fit(X_train, y_train) # Predicting the Test set results y_pred = regressor.predict(X_test) |

At this point y_pred contains predicted salaries of X_test matrix. values of y_pred are already compared with actual salaries in above screenshot.

Now, as we know in multiple linear regression,

y = b

_{0}+b_{1}X_{1}+b_{2}X_{2}+b_{3}X_{3}+….+b_{n}X_{n}

we can also represent it as

y = b

_{0}X_{0}+b_{1}X_{1}+b_{2}X_{2}+b_{3}X_{3}+….+b_{n}X_{n}where X_{0}= 1

So we can add one column with all values as 1 to represent b_{0}X_{0.}

1 2 |
import statsmodels.formula.api as sm X = np.append ( arr = np.ones([30,1]).astype(int), values = X, axis = 1) |

This is done because statsmodels library requires it to be done for constants.

Now, according to backward elimination for multiple linear regression algorithm, let us fit all variables in our model

1 2 3 |
X_opt = X[:,[0,1,2,3,4,5]] regressor_OLS = sm.OLS(endog = y, exog = X_opt).fit() regressor_OLS.summary() |

- Run above lines of code. we created another matrix X_opt with all independent variables
- Created a regressor with stats model library and fitted y and X_opt
- Now we see summary of our regressor and find p-value from it.

Now we have to remove 2nd column i.e. x2 because its p-value is maximum i.e. 0.973 ,

Execute below line

1 |
X_opt = X[:,[0,1,3,4,5]] |

Now, X_opt becomes,

Again fit our model after removing 2nd column which was a dummy variable column

1 2 |
regressor_OLS = sm.OLS(endog = y, exog = X_opt).fit() regressor_OLS.summary() |

Now again check p-values,

Value for x3 is highest i.e. 0.739 hence we will remove it. This means, x3 column, which was number of certifications done, was not having significant impact on salary prediction.

Execute below line to remove x3

1 |
X_opt = X[:,[0,1,3,5]] |

After removing x3, our X_opt becomes

Now again fit our model after removing x3,

1 2 |
regressor_OLS = sm.OLS(endog = y, exog = X_opt).fit() regressor_OLS.summary() |

Again after watching summary, we find p-values as below

Now, here p value of x1 i.e. another dummy variable from our perspective is 0.516 which needs to be removed. So our X_opt should be

1 |
X_opt = X[:,[0,3,5]] |

Execute above line and now X_opt becomes

Again execute below lines

1 2 |
regressor_OLS = sm.OLS(endog = y, exog = X_opt).fit() regressor_OLS.summary() |

This will give p values as below

Now remove x1 column which represents our worked hours, p values is coming as 0.176 which is way over our significance level 0.05 hence we can remove x1 i.e. worked hours column too.

1 |
X_opt = X[:,[0,5]] |

Execute above line and X_opt becomes as shown on the right side.

Note that we now have only two columns left and if you closely watch p values for const and x2 in above p value table, they are 0.000 which are below significance level 0.05. It means they are important and have impact on salary predictions. So our backward elimination for multiple linear regression is now stopped. We will predict salaries based on current state of X_opt.

## Predict salaries for X_opt

1 2 3 4 5 6 |
# Splitting the dataset into the Training set and Test set X_opt_train, X_opt_test, y_opt_train, y_opt_test = train_test_split(X_Opt, y, test_size = 1/3, random_state = 0) regressor_opt = LinearRegression() regressor_opt.fit(X_opt_train, y_opt_train) y_opt_pred = regressor_opt.predict(X_opt_test) |

after this predictions, let us compare salaries predicted using simple linear regression, multiple linear regression for original X, multiple linear regression for X_opt and real salaries from y_test.

Note that predictions obtained from backward elimination for multiple linear regression are much closer to actual results and same as simple linear regression in our case.

I hope this article helped understand backward elimination for multiple linear regression.

Really good read, looking forwards for more

Thank you Sachin, I took your word advice about machine learning seriously 🙂

Pingback:Polynomial Regression - theJavaGeek

where can I get the formula that has been used in this context to calculate p values of individual columns?

Hi Anwesh, thank you for your comment. You can get the code and formula for p value calculation in statsmodel library of python. It is used in given program

how do I plot this y_opt_pred on graph.

I am trying

plt.scatter(X_opt_train,y_opt_train,color = “red”)

plt.show()

it shows error (x and y should be same size )

can you please share the states of x and y matrices?

Good Article.

The “cross_validation” library is deprecated and you should use “model_selection” library instead to split the train and test data.

# Splitting the dataset into the Training set and Test set

from sklearn.model_selection import train_test_split

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size = 1/3, random_state = 0)

Hi n mohammad, thank you for your suggestion. I will make necessary changes in article soon. Keep providing constrive feedback

is there any other method to eliminate the columns automatically like by using for loop, as here we are doing it manually.

Yes we can loop through it till threshold is achieved

Hi Prasad,

Could you please why we fitted our data to an OLS model instead of the LinearRegression model we have been using in the line below ?

regressor_OLS = sm.OLS(endog = y, exog = X_opt).fit()

We should be fitting and subsequently refitting the data to the model we have been using right ?

arr = np.ones([30,1]).astype(int), values = X, axis = 1)

In above line, how I decide the ones size(ig.30,1). Does it depend on the dataset?