# Principal Component Analysis using Python

#### Latest posts by Prasad Kharkar (see all)

- PyCharm for Machine Learning - July 17, 2018
- Linear Discriminant Analysis using Python - April 30, 2018
- Principal Component Analysis using Python - April 30, 2018

Hello all, till now, we have learned about many machine learning models. Now, we will learn about some feature extraction techniques. In this article, we will learn principal component analysis using python.

# Principal Component Analysis Using Python:

Suppose there are m independent variables in your dataset. Performing principal component analysis using python creates **m** or less than **m** number of new independent variables ordered from most to least explained variance in existing independent variables. Principal component analysis comes under unsupervised learning because we don’t consider dependent variables.

We usually use 2 dimensional graphs for visualizing machine learning models, so we will perform principal component analysis using python on our dataset and reduce its dimensions to 2.

## Objective:

Please download **PCA **dataset from superdatascience website. You will find **wines.csv** file in the folder. It contains 14 columns. First 13 columns i.e. independent variables are contents of wine. Last column is customer segment. It has 3 values. This is a classification problem. Our objective is to reduce 13 independent variables to 2 so that we can visualize results on graph.

## Data Preprocessing:

We have learned a lot of machine learning models till now, hence following code should be easy to understand.

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 |
import numpy as np import matplotlib.pyplot as plt import pandas as pd # Importing the dataset dataset = pd.read_csv('Wine.csv') X = dataset.iloc[:, 0:13].values y = dataset.iloc[:, 13].values # Splitting the dataset into the Training set and Test set from sklearn.model_selection import train_test_split X_train, X_test, y_train, y_test = train_test_split(X, y, test_size = 0.1, random_state = 0) # Feature Scaling from sklearn.preprocessing import StandardScaler sc = StandardScaler() X_train = sc.fit_transform(X_train) X_test = sc.transform(X_test) |

- We imported dataset and created matrices of independent and dependent variables
- We split dataset into training set and test set.
- Performed feature scaling.

## Principal Component Analysis Using Python:

Now, we will extract 2 new independent variables with most explained variance and second most explained variance.

1 2 3 4 |
from sklearn.decomposition import PCA pca = PCA(n_components = 2) X_train = pca.fit_transform(X_train) X_test = pca.transform(X_test) |

Note that we created an object of PCA and passed **n_components = 2. **It selects 2 new independent variable with most explained variance and second most explained variance. Those are called principal components.X_train and X_test now becomes matrices with 2 columns.

## Perform Classification:

1 2 3 4 5 6 7 8 9 10 |
from sklearn.linear_model import LogisticRegression classifier = LogisticRegression(random_state = 0) classifier.fit(X_train, y_train) # Predicting the Test set results y_pred = classifier.predict(X_test) # Making the <a class="zem_slink" title="Confusion matrix" href="http://en.wikipedia.org/wiki/Confusion_matrix" target="_blank" rel="noopener wikipedia">Confusion Matrix</a> from sklearn.metrics import confusion_matrix cm = confusion_matrix(y_test, y_pred) |

We simply perform a logistic regression classification with 2 independent variables and predict results for X_test. Execute above code and we can see confusion matrix.

We can see that diagonal results i.e. 6 + 6 + 5 i.e. 17 predictions are correct predictions and 1 prediction is incorrect. So accuracy of classification model after principal component analysis using python is 17/18 * 100 = 94.44%.

## Visualize Results:

We can plot a graph with two principal components obtained above.

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 |
from matplotlib.colors import ListedColormap X_set, y_set = X_test, y_test aranged_pc1 = np.arange(start = X_set[:, 0].min(), stop = X_set[:, 0].max(), step = 0.01) aranged_pc2 = np.arange(start = X_set[:, 1].min(), stop = X_set[:, 1].max(), step = 0.01) X1, X2 = np.meshgrid(aranged_pc1, aranged_pc2) plt.contourf(X1, X2, classifier.predict(np.array([X1.ravel(), X2.ravel()]).T).reshape(X1.shape), alpha = 0.5, cmap = ListedColormap(('orange', 'blue', 'green'))) plt.xlim(X1.min(), X1.max()) plt.ylim(X2.min(), X2.max()) for i, j in enumerate(np.unique(y_set)): plt.scatter(X_set[y_set == j, 0], X_set[y_set == j, 1], c = ListedColormap(('red', 'green','blue'))(i), label = j) plt.title('Principal Component Analysis') plt.xlabel('PC1') plt.ylabel('PC2') plt.legend() plt.show() |

This creates a nice graph with 3 segments of customers and plots existing customers in it.

Just like confusion matrix. graph also shows only one incorrect prediction.I hope this article helped understand principal component analysis using python.