Artificial Intelligence 🤖
Principal Component Analysis (PCA)

Principal Component Analysis (PCA)

Principal Component Analysis (PCA) is an unsupervised type of feature extraction, designed for reducing the dimensionality of datasets while maintaining their critical variance. Essentially, it rotates the dataset in a manner that ensures resultant features are statistically uncorrelated, typically followed by a selection of a subset of new features based on their data-explaining importance.

Understanding PCA

Principal Component Analysis (PCA), rooted deeply in linear algebra, is a methodology to maximize the variance captured while removing redundant correlations within a dataset by devising new variables via weighted linear combinations of the original ones.

PCA creates the new variables by transforming the original (mean-centered) observations (records) in a dataset to a new set of variables (dimensions) using the eigenvectors and eigenvalues calculated from a covariance matrix of your original variables. Step-by-step:

  1. Centering the values of all of the input variables
  2. Potential scaling of the data, depending on the units of the variables
  3. Calculating the covariance matrix of the data
  4. Calculating the eigenvectors and eigenvalues of the covariance matrix
  5. The principal components (eigenvectors) are sorted by descending eigenvalue.

1. Mean Centering

The first step of PCA is centering the values of all of the input variables (e.g., subtracting the mean of each variable from the values), making the mean of each variable equal to zero. Centering is an important pre-processing step because it ensures that the resulting components are only looking at the variance within the dataset, and not capturing the overall mean of the dataset as an important variable (dimension). Without mean-centering, the first principal component found by PCA might correspond with the mean of the data instead of the direction of maximum variance.

2. Scaling

Dimensional reduction using PCA consists of finding the features that maximize the variance. If one feature varies more than the others only because of their respective scales, PCA would determine that such feature dominates the direction of the principal components. We can inspect the first principal components using all the original features:

import pandas as pd
 
from sklearn.decomposition import PCA
 
pca = PCA(n_components=2).fit(X_train)
scaled_pca = PCA(n_components=2).fit(scaled_X_train)
X_train_transformed = pca.transform(X_train)
X_train_std_transformed = scaled_pca.transform(scaled_X_train)
 
first_pca_component = pd.DataFrame(
    pca.components_[0], index=X.columns, columns=["without scaling"]
)
first_pca_component["with scaling"] = scaled_pca.components_[0]
first_pca_component.plot.bar(
    title="Weights of the first principal component", figsize=(6, 8)
)
 
_ = plt.tight_layout()

scaling pca

Indeed we find that the "proline" feature dominates the direction of the first principal component without scaling, being about two orders of magnitude above the other features. This is contrasted when observing the first principal component for the scaled version of the data, where the orders of magnitude are roughly the same across all the features.

We can visualize the distribution of the principal components in both cases:

fig, (ax1, ax2) = plt.subplots(nrows=1, ncols=2, figsize=(10, 5))
 
target_classes = range(0, 3)
colors = ("blue", "red", "green")
markers = ("^", "s", "o")
 
for target_class, color, marker in zip(target_classes, colors, markers):
    ax1.scatter(
        x=X_train_transformed[y_train == target_class, 0],
        y=X_train_transformed[y_train == target_class, 1],
        color=color,
        label=f"class {target_class}",
        alpha=0.5,
        marker=marker,
    )
 
    ax2.scatter(
        x=X_train_std_transformed[y_train == target_class, 0],
        y=X_train_std_transformed[y_train == target_class, 1],
        color=color,
        label=f"class {target_class}",
        alpha=0.5,
        marker=marker,
    )
 
ax1.set_title("Unscaled training dataset after PCA")
ax2.set_title("Standardized training dataset after PCA")
 
for ax in (ax1, ax2):
    ax.set_xlabel("1st principal component")
    ax.set_ylabel("2nd principal component")
    ax.legend(loc="upper right")
    ax.grid()
 
_ = plt.tight_layout()

pca

From the plot above we observe that scaling the features before reducing the dimensionality results in components with the same order of magnitude. In this case it also improves the separability of the classes. Indeed, in the next section we confirm that a better separability has a good repercussion on the overall model’s performance

3. The Covariance Matrix

Once the data has been centered (and possibly scaled, depending on the units of the variables) the covariance matrix of the data needs to be calculated. Covariance here is a measure of how two variables linearly relate to each other. It is systematically examined for all variable combinations. This matrix, inherently square and symmetric, positions variances along its diagonal and replicates covariances across it, presenting a comprehensive, two-dimensional overview of how variables linearly correlate.

This visualization from stats.stackexchange (opens in a new tab) is super helpful for understanding covariance.

4. Eigenvectors and Eigenvalues

Because covariance matrices are square and symmetrical, covariance matrixes are diagonalizable, which means an eigendecomposition can be calculated on the matrix. This is where PCA finds the eigenvectors and eigenvalues for the data set. An eigenvector of a linear transformation is a (non-zero) vector that changes by a scalar multiple of itself when the related linear transformation is applied to it. The eigenvalue is the scalar associated with the eigenvector.

In the context of understanding PCA at a high level, all you really need to know about eigenvectors and eigenvalues is that the eigenvectors of the covariance matrix are the axes of the principal components in a dataset. The eigenvectors define the directions of the principal components calculated by PCA. The eigenvalues associated with the eigenvectors describe the magnitude of the eigenvector, or how far spread apart the observations (points) are along the new axis.

eigenvectors

The first eigenvector will span the greatest variance (separation between points) found in the dataset, and all subsequent eigenvectors will be perpendicular (or in math-speak, orthogonal) to the one calculated before it. This is how we can know that each of the principal components will be uncorrelated with one another. Each eigenvector found by PCA will pick up a combination of variance from the original variables in the data set.

In this visual, Principal component 1 accounts for variance from both variables A and B:

pca varaince

5. Principal Components

The eigenvalues are important because they provide a ranking criterion for the newly derived variables (axes). The principal components (eigenvectors) are sorted by descending eigenvalue. The principal components with the highest eigenvalues are "picked first" as principal components because they account for the most variance in the data.

Visualization tools such as Scree plots can assist in determining the number of principal components to retain, by displaying the variance encapsulated by each component and helping identify a point where additional components contribute diminutively to explaining further variance.

scree plot

Following the identification and selection of principal components, original data observations are converted to these components through the creation of a projection matrix. This projection matrix is just the selected eigenvectors concatenated to a matrix. We can then multiply the matrix of our original observations and variables by our projection matrix. The output of this process is a transformed data set, projected into our new data space - made up of our principal components!

pca projection

Assumptions & Limitations

There are a few things to consider before applying PCA.

Normalizing (opens in a new tab) the data prior to performing PCA can be important, particularly when the variables have different units or scales.

PCA assumes that the data can be approximated by a linear structure and that the data can be described with fewer features. It assumes that a linear transformation can and will capture the most important aspects of the data. It also assumes that high variance in the data means that there is a high signal-to-noise ratio.

Dimensionality reduction does result in a loss of some information. By not keeping all the eigenvectors, there is some information that is lost. However, if the eigenvalues of the eigenvectors that are not included are small, you are not losing too much information.

Another consideration to make with PCA is that the variables become less interpretable after being transformed. An input variable might mean something specific like "UV light exposure," but the variables created by PCA are a mishmash of the original data and can’t be interpreted in a clear way like "an increase in UV exposure is correlated with an increase in skin cancer presence." Less interpretable also means less explainable, when you're pitching your models to others.

Strengths

PCA is popular because it can effectively find an optimal representation of a data set with fewer dimensions. It is effective at filtering noise and decreasing redundancy. If you have a data set with many continuous variables, and you aren’t sure how to go about selecting important features for your target variable, PCA might be perfect for your application. In a similar vein, PCA is also popular for visualizing data sets with high-dimensionality (because it's hard for us meager humans to think in more than three dimensions).

Example 1: PCA with the Iris Dataset

Overview of the Iris Dataset and Initial Implementation

The Iris dataset, incorporated with scikit-learn, provides four-dimensional feature data derived from two physical components of the Iris flower: the petal and sepal, each with measurements of length and width. The dataset contains 150 samples and comprises measurements pertaining to three species of Iris: Setosa, Versicolor, and Virginica. Principal Component Analysis (PCA) facilitates the visualization of this multi-dimensional data in a two-dimensional space while attempting to preserve its variance. Utilizing PCA through scikit-learn is succinctly accomplished as shown in the following implementation.

from sklearn.datasets import load_iris
from sklearn.decomposition import PCA
import pylab as pl
from itertools import cycle
 
iris = load_iris()
numSamples, numFeatures = iris.data.shape
 
X = iris.data
pca = PCA(n_components=2, whiten=True).fit(X)
X_pca = pca.transform(X)

Insight into Principal Components and Variance

When PCA distills the 4D Iris dataset down to 2D, it selects two orthogonal 4D vectors, which formulate the basis of the new 2D projection. A glimpse at these eigenvectors or principal components is possible through:

print(pca.components_)

Displayed as:

[[ 0.36158968 -0.08226889  0.85657211  0.35884393]
 [ 0.65653988  0.72971237 -0.1757674  -0.07470647]]

Moreover, understanding the preserved variance when reducing dimensions is essential. PCA provides an explained_variance_ratio_, detailing the proportion of variance each principal component retains. For the two principal components utilized here, they hold approximately 92% and 5% of the data's variance, summing to a noteworthy 97%, implying a notable retention of information despite the reduction from four dimensions to two.

print(pca.explained_variance_ratio_)
print(sum(pca.explained_variance_ratio_))

Resulting in:

[ 0.92461621  0.05301557]
0.977631775025

Visual Representation of the 2D Projection

Visualizing the two principal components can be achieved via a scatter plot. The implementation below iteratively plots each Iris species using a distinct color, offering a visual representation of the clusters formed in the new two-dimensional space.

%matplotlib inline
from pylab import *
 
colors = cycle('rgb')
target_ids = range(len(iris.target_names))
pl.figure()
for i, c, label in zip(target_ids, colors, iris.target_names):
    pl.scatter(X_pca[iris.target == i, 0], X_pca[iris.target == i, 1],
        c=c, label=label)
pl.legend()
pl.show()

The following is what we end up with:

That is our 4D Iris data projected down to 2 dimensions. The visualization reflects clusters of the Iris species, despite minor intermingling, confirming PCA’s efficacy in dimension reduction while preserving variance. It's really hard to imagine what these actual values represent. But, the important point is, we've projected 4D4 \mathrm{D} data down to 2D2 \mathrm{D}, and in such a way that we still preserve the variance.

Example 2: Image Segmentation Data

Using applypca1 defined in the intro

applypca1(Xtrain, Xtest, Ytrain, Ytest, columns=COLUMNS)

png

This curve quantifies how much of the total, 19-dimensional variance is contained within the first N components. For example, we see that with the segmentation dataset the first 5 components contain approximately 75% of the variance, while you need around 12 components to describe close to 100% of the variance.

Here we see that our two-dimensional projection loses a lot of information (as measured by the explained variance) and that we'd need about 12 components to retain 90% of the variance. Looking at this plot for a high-dimensional dataset can help you understand the level of redundancy present in multiple observations.

Hence, we will keep n_components = 12 and extract the principal components followed by giving it to the existing Random Forest model to evaluate performance:

applypca2(Xtrain, Xtest, Ytrain, Ytest, columns=COLUMNS)
Time Elapsed: 6.234650976000012 secs
Classification Report after applying Random Forest:
----------------------------------------------------
              precision    recall  f1-score   support

           0       0.97      1.00      0.98        30
           1       0.97      1.00      0.98        30
           2       0.97      0.97      0.97        30
           3       1.00      1.00      1.00        30
           4       1.00      1.00      1.00        30
           5       1.00      1.00      1.00        30
           6       1.00      0.93      0.97        30

    accuracy                           0.99       210
   macro avg       0.99      0.99      0.99       210
weighted avg       0.99      0.99      0.99       210

After applying PCA, performance is still the best even after reducing number of features from 19 to 12. If we do:

applypca3(Xtrain, Xtest, Ytrain, Ytest, columns=COLUMNS)

png

You can see that in the first 3 components, all features have almost the same sign (near to 0.0). That means that there is a general correlation between all features. As one measurement is high, the others are likely to be high as well. From 3rd component has mixed signs, and further components involve all of the 19 features from Image Segmentation Dataset.

Example 3: PCA Application on Diabetes Data

Principal Component Analysis (or PCA) uses linear algebra to transform the dataset into a compressed form. Generally this is called a data reduction technique. A property of PCA is that you can choose the number of dimensions or principal component in the transformed result. In the example below, we use PCA and select 3 principal components.

# Feature Extraction with PCA
import numpy
from pandas import read_csv
from sklearn.decomposition import PCA
# load data
url = "https://raw.githubusercontent.com/jbrownlee/Datasets/master/pima-indians-diabetes.csv"
names = ['preg', 'plas', 'pres', 'skin', 'test', 'mass', 'pedi', 'age', 'class']
dataframe = read_csv(url, names=names)
array = dataframe.values
X = array[:,0:8]
Y = array[:,8]
# feature extraction
pca = PCA(n_components=3)
fit = pca.fit(X)
# summarize components
# Percentage of variance explained by each of the selected components.
print("Explained Variance: %s" % fit.explained_variance_ratio_)
print(fit.components_)
Explained Variance: [0.889 0.062 0.026]
[[-2.022e-03  9.781e-02  1.609e-02  6.076e-02  9.931e-01  1.401e-02
   5.372e-04 -3.565e-03]
 [-2.265e-02 -9.722e-01 -1.419e-01  5.786e-02  9.463e-02 -4.697e-02
  -8.168e-04 -1.402e-01]
 [-2.246e-02  1.434e-01 -9.225e-01 -3.070e-01  2.098e-02 -1.324e-01
  -6.400e-04 -1.255e-01]]

You can see that the transformed dataset (3 principal components) bare little resemblance to the source data.

Conclusion

In sum, PCA represents a strategically vital technique in the data scientist’s toolkit, providing a mechanism through which high-dimensional data can be rendered into a more manageable form without substantial loss of variance, ensuring that the resultant data retains maximal explanatory power.


Resources: