Deciding on how many principal components to retain


Introduction

Principal Component Analysis (PCA) is a statistical procedure that uses an orthogonal transformation to convert a set of observations of possibly correlated variables into a set of values of linearly uncorrelated variables called principal components. The number of principal components is less than or equal to the number of original variables. This transformation is defined in such a way that the first principal component has the largest possible variance, and each succeeding component in turn has the highest variance possible under the constraint that it is orthogonal to the preceding components.

Key Concepts and Principles

Principal components analysis (PCA)

PCA is a method used to bring out strong patterns in a dataset by supressing variations that are noise. It is also used when variables are strongly correlated. If variables are correlated, then they contain redundant information. So, PCA is a method of extracting important variables (in form of components) from a large set.

Eigenvalues and eigenvectors

In PCA, eigenvalues and eigenvectors are used to select the principal components. Eigenvalues represent variance explained by each principal component, while eigenvectors represent the weights or loadings of the original variables.

Explained variance

Explained variance refers to the amount of variance in the data that is captured by each principal component. The first principal component captures the most variance, the second captures the next highest amount of variance, and so on.

Scree plot

A scree plot is a line plot of the eigenvalues of factors or principal components in an analysis. The scree plot is used to determine the number of factors to retain in an exploratory factor analysis (EFA) or principal components to keep in a PCA.

Cumulative explained variance

Cumulative explained variance is the total amount of variance explained by the first n components. This is used to decide how many principal components are sufficient to represent the data.

Step-by-Step Walkthrough of Typical Problems and Solutions

Determining the number of principal components to retain using the scree plot method

  1. Plot the eigenvalues in descending order against the corresponding principal components.
  2. Identify the 'elbow' point in the scree plot, where the plot starts flattening out.
  3. The number of components before the elbow point are the ones to retain.

Determining the number of principal components to retain using the cumulative explained variance method

  1. Calculate the cumulative explained variance for each principal component.
  2. Plot the cumulative explained variance against the number of principal components.
  3. The number of components that explain a desired level (e.g., 95%) of the total variance are the ones to retain.

Real-World Applications and Examples

Dimensionality reduction in data analysis

PCA is often used in data analysis for dimensionality reduction. By retaining only the most important principal components, the number of variables in a dataset can be significantly reduced, making the data easier to work with and the results easier to interpret.

Image compression and reconstruction

PCA can also be used for image compression and reconstruction. By retaining only the most important features of an image (i.e., the principal components), the storage space and transmission time can be significantly reduced.

Advantages and Disadvantages of Deciding on How Many Principal Components to Retain

Advantages

  1. Simplifies complex datasets by reducing the number of variables.
  2. Reduces dimensionality and improves interpretability.
  3. Preserves most of the information in the original dataset.

Disadvantages

  1. Loss of some information due to dimensionality reduction.
  2. Subjectivity in selecting the number of principal components to retain.

Summary

Deciding on how many principal components to retain is a crucial step in Principal Component Analysis (PCA). This decision is often based on the eigenvalues of the components, the explained variance, and the cumulative explained variance. Two common methods for deciding the number of components to retain are the scree plot method and the cumulative explained variance method. PCA has many real-world applications, including dimensionality reduction in data analysis and image compression and reconstruction. While PCA simplifies complex datasets and improves interpretability, it may also lead to loss of some information.

Analogy

Deciding on how many principal components to retain is like packing for a trip. You want to pack as few items as possible (to reduce the weight of your luggage), but you also want to make sure you have everything you need. Similarly, in PCA, you want to retain as few components as possible (to simplify your data), but you also want to make sure you capture as much of the variance in your data as possible.

Quizzes
Flashcards
Viva Question and Answers

Quizzes

What is the purpose of a scree plot in PCA?
  • To plot the eigenvalues in ascending order
  • To plot the explained variance against the number of components
  • To plot the cumulative explained variance against the number of components
  • To plot the eigenvalues in descending order

Possible Exam Questions

  • Explain the process of deciding on how many principal components to retain using the scree plot method.

  • Explain the process of deciding on how many principal components to retain using the cumulative explained variance method.

  • What are the advantages and disadvantages of deciding on how many principal components to retain?

  • What are some real-world applications of deciding on how many principal components to retain?

  • Explain the concept of eigenvalues and eigenvectors in the context of PCA.