Siksha Sarovar

Siksha Sarovar (sikshasarovar.com) is a free educational web application that helps students in India learn programming and prepare for academic and competitive exams. The platform offers structured coding courses (C, C++, Python, Java, HTML, CSS, PHP, Power BI, AI, Machine Learning, Data Science), complete university curriculum notes for BCA/MCA students with previous year question papers, Class 10 and Class 12 CBSE/HBSE school notes, and dedicated preparation material for SSC, UPSC, Banking, Railway and other government exams. Browsing the site is completely free and requires no account. Users may optionally sign in with Google solely to save their learning progress, quiz scores and personal preferences across devices.

Privacy Policy | Terms of Service | Contact Siksha Sarovar | About Siksha Sarovar

v4.0.9 · PWA
Siksha Sarovar logo
Siksha Sarovar
Your Learning Universe

Siksha Sarovar is a free e-learning platform for coding courses, BCA university notes and competitive exam preparation. Optional Google sign-in saves your learning progress across devices.

Initializing knowledge base…
Compiling modules 0%

7. Principal Component Analysis (PCA)

Lesson 9 of 22 in the free Machine Learning II notes on Siksha Sarovar, written by Rohit Jangra.

7. Principal Component Analysis (PCA)

PCA is the most widely used unsupervised dimensionality reduction technique. Developed by Pearson (1901) and Hotelling (1933), PCA finds an orthogonal basis that captures maximum variance in the data by solving an eigendecomposition of the covariance matrix.

Mathematical Foundation

Given centered data matrix X (n x p), PCA solves: C = (1/(n-1)) X^T X

The principal components are eigenvectors of C: C v_i = lambda_i v_i

Eigenvalue lambda_i = variance explained by component i.

Algorithm (Eigendecomposition)

  1. Standardize data: subtract mean, optionally divide by std dev
  2. Compute covariance matrix C = X^T * X / (n-1)
  3. Solve eigendecomposition: C V = V Lambda
  4. Sort eigenvectors by descending eigenvalue
  5. Project: Z = X * V_k (keep top-k components)

Choosing k (Number of Components)

Two common criteria:

  • Explained variance threshold: Keep enough components for 95% cumulative variance
  • Elbow method: Find the 'knee' in the scree plot (rapid drop in eigenvalues)

Explained ratio = lambda_i / sum(lambda_j) Cumulative ratio = sum_{j=1}^{k}(lambda_j) / sum(lambda_j)

PCA via SVD

More numerically stable: decompose X = U Sigma V^T

  • V columns are principal components (directions)
  • Sigma diagonal entries are related to eigenvalues: lambda_i = sigma_i^2 / (n-1)
  • sklearn uses SVD internally for this reason

Reconstruction Error

The reconstruction error after keeping k components: ||X - Z * V_k^T||^2 = sum_{i=k+1}^{p} lambda_i

Common Pitfalls

  • Always standardize features before PCA if they are on different scales
  • PCA captures variance, not necessarily information useful for classification
  • Principal components are linear combinations — interpretability is lost

Exam-Ready Summary

  • PCA: unsupervised, finds orthogonal directions of maximum variance
  • Eigenvectors of covariance matrix = principal component directions
  • Eigenvalues = variance explained by each component
  • Choose k by cumulative explained variance (typically 95%) or elbow in scree plot
  • PCA preprocessing before classification can hurt (not all variance is class-relevant)