# Backward Elimination for multiple linear regression

The following two tabs change content below. Prasad Kharkar is a java enthusiast and always keen to explore and learn java technologies. He is SCJP,OCPWCD, OCEJPAD and aspires to be java architect. #### Latest posts by Prasad Kharkar (see all)

We learned about multiple linear regression and predicted values of dependent variables based on multiple independent variables. However how can we identify the impact made by a specific independent variable on dependent variable? We can follow backward elimination for multiple linear regression to identify independent variables which have most impact on dependent variables.

# Backward Elimination For Multiple Regression

Before we dive deeper, let us understand why we are doing this. Consider our example dataset

 Department WorkedHours Certification YearsExperience Salary Development 2300 0 1.1 39343 Testing 2100 1 1.3 46205 Development 2104 2 1.5 37731 UX 1200 1 2 43525 Testing 1254 2 2.2 39891 UX 1236 1 2.9 56642 Development 1452 2 3 60150 Testing 1789 1 3.2 54445 UX 1645 1 3.2 64445 UX 1258 0 3.7 57189 Testing 1478 3 3.9 63218 Development 1257 2 4 55794 Development 1596 1 4 56957 Testing 1256 2 4.1 57081 UX 1489 3 4.5 61111 Development 1236 3 4.9 67938 Testing 2311 2 5.1 66029 UX 2245 3 5.3 83088 Development 2365 1 5.9 81363 Development 1500 3 6 93940 Testing 1456 2 6.8 91738 Testing 1760 1 7.1 98273 UX 2400 4 7.9 101302 Development 2148 3 8.2 113812 UX 1450 2 8.7 109431 UX 1000 4 9 105582 Testing 1540 3 9.5 116969 Development 1500 2 9.6 112635 Testing 3000 4 10.3 122391 UX 2100 3 10.5 121872

Note that we will split above dataset in train and test data to verify predictions. We have already done regressions with reference of above data

We have below results for them

If we compare salaries predicted by simple linear regression and multiple linear regression,we can note that simple linear regression results are much closer to actual salaries. Now what could be the reason? We may have some columns which are not really significant to predict salaries. For example, number of certifications done may not impact much on salary. We want to make our model better by eliminating the variables which do not have much impact on salary. Here we need to do backward elimination for multiple linear regression.

## Backward Elimination:

• Select significance level
• Fit our model with all possible independent variables.
• Consider variable with highest p-value.
• If p-value is greater than significance level, remove variable
• Again fit the model without removed variable.

here, significance level and p-value are statistical terms. Just remember these terms for now as we do not want to go in details. Just note that our python libraries will provide us these values for our independent variables.

Now coming to our scenario, we want our salary predictions to be more accurate and we have to decide which independent variables to consider for making a final model.

## Backward Elimination For Multiple Linear Regression:

Please note that we are taking this program from previous article as we need values of predicted salaries y_pred.

At this point y_pred contains predicted salaries of X_test matrix. values of y_pred are already compared with actual salaries in above screenshot.

Now, as we know in multiple linear regression,

y = b0+b1X1+b2X2+b3X3+….+bnXn

we can also represent it as

y = b0X0+b1X1+b2X2+b3X3+….+bnXn where X0 = 1

So we can add one column with all values as 1 to represent b0X0.

This is done because statsmodels library requires it to be done for constants.

Now, according to backward elimination for multiple linear regression algorithm, let us fit all variables in our model

• Run above lines of code. we created another matrix X_opt with all independent variables
• Created a regressor with stats model library and fitted y and X_opt
• Now we see summary of our regressor and find p-value from it.

Now  we have to remove 2nd column i.e. x2 because its p-value is maximum i.e. 0.973 ,

Execute below line

Now, X_opt becomes,

Again fit our model after removing 2nd column which was a dummy variable column

Now again check p-values,

Value for x3 is highest i.e. 0.739 hence we will remove it. This means, x3 column, which was number of certifications done, was not having significant impact on salary prediction.

Execute below line to remove x3

After removing x3, our X_opt becomes

Now again fit our model after removing x3,

Again after watching summary, we find p-values as below

Now, here p value of x1 i.e. another dummy variable from our perspective is 0.516 which needs to be removed. So our X_opt should be

Execute above line and now X_opt becomes

Again execute below lines

This will give p values as below

Now remove x1 column which represents our worked hours, p values is coming as 0.176 which is way over our significance level 0.05 hence we can remove x1 i.e. worked hours column too.

Execute above line and X_opt becomes as shown on the right side.

Note that we now have only two columns left and if you closely watch p values for const and x2 in above p value table, they are 0.000 which are below significance level 0.05. It means they are important and have impact on salary predictions. So our backward elimination for multiple linear regression is now stopped. We will predict salaries based on current state of X_opt.

## Predict salaries for X_opt

after this predictions, let us compare salaries predicted using simple linear regression, multiple linear regression for original X, multiple linear regression for X_opt and real salaries from y_test.

Note that predictions obtained from backward elimination for multiple linear regression are much closer to actual results and same as simple linear regression in our case.

I hope this article helped understand backward elimination for multiple linear regression.

### References: Prasad Kharkar is a java enthusiast and always keen to explore and learn java technologies. He is SCJP,OCPWCD, OCEJPAD and aspires to be java architect.

### 13 thoughts on “Backward Elimination for multiple linear regression”

• February 15, 2018 at 3:43 pm

Really good read, looking forwards for more

• February 15, 2018 at 11:04 pm

Thank you Sachin, I took your word advice about machine learning seriously 🙂

• Pingback:Polynomial Regression - theJavaGeek

• May 21, 2018 at 7:43 pm

where can I get the formula that has been used in this context to calculate p values of individual columns?

• May 30, 2018 at 10:16 pm

Hi Anwesh, thank you for your comment. You can get the code and formula for p value calculation in statsmodel library of python. It is used in given program

• May 23, 2018 at 12:29 pm

how do I plot this y_opt_pred on graph.
I am trying
plt.scatter(X_opt_train,y_opt_train,color = “red”)
plt.show()

it shows error (x and y should be same size )

• May 30, 2018 at 10:14 pm

can you please share the states of x and y matrices?

• July 18, 2018 at 9:41 pm

Good Article.
The “cross_validation” library is deprecated and you should use “model_selection” library instead to split the train and test data.

# Splitting the dataset into the Training set and Test set
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size = 1/3, random_state = 0)

• July 24, 2018 at 11:32 am

Hi n mohammad, thank you for your suggestion. I will make necessary changes in article soon. Keep providing constrive feedback

• July 19, 2018 at 11:36 pm

is there any other method to eliminate the columns automatically like by using for loop, as here we are doing it manually.

• July 24, 2018 at 11:36 am

Yes we can loop through it till threshold is achieved

• August 13, 2018 at 4:10 pm

Could you please why we fitted our data to an OLS model instead of the LinearRegression model we have been using in the line below ?

regressor_OLS = sm.OLS(endog = y, exog = X_opt).fit()

We should be fitting and subsequently refitting the data to the model we have been using right ?

• 