Principal Component Analysis using Python

The following two tabs change content below.
Prasad Kharkar is a java enthusiast and always keen to explore and learn java technologies. He is SCJP,OCPWCD, OCEJPAD and aspires to be java architect.

Latest posts by Prasad Kharkar (see all)

Hello all, till now, we have learned about many machine learning models. Now, we will learn about some feature extraction techniques. In this article, we will learn principal component analysis using python.

Principal Component Analysis Using Python:

Suppose there are m independent variables in your dataset. Performing principal component analysis using python creates m or less than m number of new independent variables ordered from most to least explained variance in existing independent variables. Principal component analysis comes under unsupervised learning because we don’t consider dependent variables.

We usually use 2 dimensional graphs for visualizing machine learning models, so we will perform principal component analysis using python on our dataset and reduce its dimensions to 2.

Objective:

Please download PCA dataset from superdatascience website. You will find wines.csv file in the folder. It contains 14 columns. First 13 columns i.e. independent variables are contents of wine. Last column is customer segment. It has 3 values. This is a classification problem. Our objective is to reduce 13 independent variables to 2 so that we can visualize results on graph.

Data Preprocessing:

We have learned a lot of machine learning models till now, hence following code should be easy to understand.

  • We imported dataset and created matrices of independent and dependent variables
  • We split dataset into training set and test set.
  • Performed feature scaling.

Principal Component Analysis Using Python:

Now, we will extract 2 new independent variables with most explained variance and second most explained variance.

Note that we created an object of PCA and passed n_components = 2. It selects 2 new independent variable with most explained variance and second most explained variance. Those are called principal components.X_train and X_test now becomes matrices with 2 columns.

Perform Classification:

We simply perform a logistic regression classification with 2 independent variables and predict results for X_test. Execute above code and we can see confusion matrix.

confusion matrix for PCA

confusion matrix for PCA

We can see that diagonal results i.e. 6 + 6 + 5 i.e. 17 predictions are correct predictions and 1 prediction is incorrect. So accuracy of classification model after principal component analysis using python is 17/18 * 100 = 94.44%.

Visualize Results:

We can plot a graph with two principal components obtained above.

This creates a nice graph with 3 segments of customers and plots existing customers in it.

customer segment classification

customer segment classification

Just like confusion matrix. graph also shows only one incorrect prediction.I hope this article helped understand principal component analysis using python.

 

References:

Share Button

Leave a Reply

Your email address will not be published. Required fields are marked *