Backward Elimination for multiple linear regression

The following two tabs change content below.
Prasad Kharkar is a java enthusiast and always keen to explore and learn java technologies. He is SCJP,OCPWCD, OCEJPAD and aspires to be java architect.

Latest posts by Prasad Kharkar (see all)

We learned about multiple linear regression and predicted values of dependent variables based on multiple independent variables. However how can we identify the impact made by a specific independent variable on dependent variable? We can follow backward elimination for multiple linear regression to identify independent variables which have most impact on dependent variables.

Backward Elimination For Multiple Regression

Before we dive deeper, let us understand why we are doing this. Consider our example dataset

Department WorkedHours Certification YearsExperience Salary
Development 2300 0 1.1 39343
Testing 2100 1 1.3 46205
Development 2104 2 1.5 37731
UX 1200 1 2 43525
Testing 1254 2 2.2 39891
UX 1236 1 2.9 56642
Development 1452 2 3 60150
Testing 1789 1 3.2 54445
UX 1645 1 3.2 64445
UX 1258 0 3.7 57189
Testing 1478 3 3.9 63218
Development 1257 2 4 55794
Development 1596 1 4 56957
Testing 1256 2 4.1 57081
UX 1489 3 4.5 61111
Development 1236 3 4.9 67938
Testing 2311 2 5.1 66029
UX 2245 3 5.3 83088
Development 2365 1 5.9 81363
Development 1500 3 6 93940
Testing 1456 2 6.8 91738
Testing 1760 1 7.1 98273
UX 2400 4 7.9 101302
Development 2148 3 8.2 113812
UX 1450 2 8.7 109431
UX 1000 4 9 105582
Testing 1540 3 9.5 116969
Development 1500 2 9.6 112635
Testing 3000 4 10.3 122391
UX 2100 3 10.5 121872

 

Note that we will split above dataset in train and test data to verify predictions. We have already done regressions with reference of above data

We have below results for them

Simple and Multiple Linear Regression Comparison
Simple and Multiple Linear Regression Comparison

If we compare salaries predicted by simple linear regression and multiple linear regression,we can note that simple linear regression results are much closer to actual salaries. Now what could be the reason? We may have some columns which are not really significant to predict salaries. For example, number of certifications done may not impact much on salary. We want to make our model better by eliminating the variables which do not have much impact on salary. Here we need to do backward elimination for multiple linear regression.

 Backward Elimination:

  • Select significance level
  • Fit our model with all possible independent variables.
  • Consider variable with highest p-value.
  • If p-value is greater than significance level, remove variable
  • Again fit the model without removed variable.

here, significance level and p-value are statistical terms. Just remember these terms for now as we do not want to go in details. Just note that our python libraries will provide us these values for our independent variables.

Now coming to our scenario, we want our salary predictions to be more accurate and we have to decide which independent variables to consider for making a final model.

Backward Elimination For Multiple Linear Regression:

Please note that we are taking this program from previous article as we need values of predicted salaries y_pred.

 

 

At this point y_pred contains predicted salaries of X_test matrix. values of y_pred are already compared with actual salaries in above screenshot.

Now, as we know in multiple linear regression,

y = b0+b1X1+b2X2+b3X3+….+bnXn

we can also represent it as

y = b0X0+b1X1+b2X2+b3X3+….+bnXn where X0 = 1

So we can add one column with all values as 1 to represent b0X0.

This is done because statsmodels library requires it to be done for constants.

Now, according to backward elimination for multiple linear regression algorithm, let us fit all variables in our model

  • Run above lines of code. we created another matrix X_opt with all independent variables
  • Created a regressor with stats model library and fitted y and X_opt
  • Now we see summary of our regressor and find p-value from it.
First Elimination
First Elimination

Now  we have to remove 2nd column i.e. x2 because its p-value is maximum i.e. 0.973 ,

Execute below line

Now, X_opt becomes,

X_opt after first elimination
X_opt after first elimination

Again fit our model after removing 2nd column which was a dummy variable column

Now again check p-values,

p values for second elimination
p values for second elimination

Value for x3 is highest i.e. 0.739 hence we will remove it. This means, x3 column, which was number of certifications done, was not having significant impact on salary prediction.

Execute below line to remove x3

After removing x3, our X_opt becomes

X_opt after second elimination
X_opt after second elimination

Now again fit our model after removing x3,

Again after watching summary, we find p-values as below

p values for third elimination
p values for third elimination

Now, here p value of x1 i.e. another dummy variable from our perspective is 0.516 which needs to be removed. So our X_opt should be

Execute above line and now X_opt becomes

X_opt after third elimination
X_opt after third elimination

Again execute below lines

This will give p values as below

p values for fourth elimination
p values for fourth elimination

Now remove x1 column which represents our worked hours, p values is coming as 0.176 which is way over our significance level 0.05 hence we can remove x1 i.e. worked hours column too.

 

X_opt after final elimination
X_opt after final elimination

 

Execute above line and X_opt becomes as shown on the right side.

Note that we now have only two columns left and if you closely watch p values for const and x2 in above p value table, they are 0.000 which are below significance level 0.05. It means they are important and have impact on salary predictions. So our backward elimination for multiple linear regression is now stopped. We will predict salaries based on current state of X_opt.

Predict salaries for X_opt

 

after this predictions, let us compare salaries predicted using simple linear regression, multiple linear regression for original X, multiple linear regression for X_opt and real salaries from y_test.

predictions comparison
predictions comparison

Note that predictions obtained from backward elimination for multiple linear regression are much closer to actual results and same as simple linear regression in our case.

I hope this article helped understand backward elimination for multiple linear regression.

References:

Share Button

Prasad Kharkar

Prasad Kharkar is a java enthusiast and always keen to explore and learn java technologies. He is SCJP,OCPWCD, OCEJPAD and aspires to be java architect.

13 thoughts on “Backward Elimination for multiple linear regression

    • February 15, 2018 at 11:04 pm
      Permalink

      Thank you Sachin, I took your word advice about machine learning seriously 🙂

      Reply
  • Pingback:Polynomial Regression - theJavaGeek

  • May 21, 2018 at 7:43 pm
    Permalink

    where can I get the formula that has been used in this context to calculate p values of individual columns?

    Reply
    • May 30, 2018 at 10:16 pm
      Permalink

      Hi Anwesh, thank you for your comment. You can get the code and formula for p value calculation in statsmodel library of python. It is used in given program

      Reply
  • May 23, 2018 at 12:29 pm
    Permalink

    how do I plot this y_opt_pred on graph.
    I am trying
    plt.scatter(X_opt_train,y_opt_train,color = “red”)
    plt.show()

    it shows error (x and y should be same size )

    Reply
  • July 18, 2018 at 9:41 pm
    Permalink

    Good Article.
    The “cross_validation” library is deprecated and you should use “model_selection” library instead to split the train and test data.

    # Splitting the dataset into the Training set and Test set
    from sklearn.model_selection import train_test_split
    X_train, X_test, y_train, y_test = train_test_split(X, y, test_size = 1/3, random_state = 0)

    Reply
    • July 24, 2018 at 11:32 am
      Permalink

      Hi n mohammad, thank you for your suggestion. I will make necessary changes in article soon. Keep providing constrive feedback

      Reply
  • July 19, 2018 at 11:36 pm
    Permalink

    is there any other method to eliminate the columns automatically like by using for loop, as here we are doing it manually.

    Reply
  • August 13, 2018 at 4:10 pm
    Permalink

    Hi Prasad,

    Could you please why we fitted our data to an OLS model instead of the LinearRegression model we have been using in the line below ?

    regressor_OLS = sm.OLS(endog = y, exog = X_opt).fit()

    We should be fitting and subsequently refitting the data to the model we have been using right ?

    Reply
  • August 20, 2018 at 7:26 am
    Permalink

    arr = np.ones([30,1]).astype(int), values = X, axis = 1)

    In above line, how I decide the ones size(ig.30,1). Does it depend on the dataset?

    Reply

Leave a Reply

Your email address will not be published. Required fields are marked *