Siksha Sarovar

Siksha Sarovar (sikshasarovar.com) is a free educational web application that helps students in India learn programming and prepare for academic and competitive exams. The platform offers structured coding courses (C, C++, Python, Java, HTML, CSS, PHP, Power BI, AI, Machine Learning, Data Science), complete university curriculum notes for BCA/MCA students with previous year question papers, Class 10 and Class 12 CBSE/HBSE school notes, and dedicated preparation material for SSC, UPSC, Banking, Railway and other government exams. Browsing the site is completely free and requires no account. Users may optionally sign in with Google solely to save their learning progress, quiz scores and personal preferences across devices.

Privacy Policy | Terms of Service | Contact Siksha Sarovar | About Siksha Sarovar

v4.0.9 · PWA
Siksha Sarovar logo
Siksha Sarovar
Your Learning Universe

Siksha Sarovar is a free e-learning platform for coding courses, BCA university notes and competitive exam preparation. Optional Google sign-in saves your learning progress across devices.

Initializing knowledge base…
Compiling modules 0%

2.9 Correlation & Linear Regression

Lesson 17 of 32 in the free Data Visualisation and Analytics notes on Siksha Sarovar, written by Rohit Jangra.

Correlation & Regression: Modeling Relationships

1. Pearson Correlation Coefficient (r)

Quantifies the linear relationship between two continuous variables.

Study Deep: Adjusted R-Squared

Standard R-Squared has a major flaw: it will never decrease when you add new variables, even if they are useless "noise."

  • The Problem: This encourages Overfitting.
  • The Solution: Adjusted R-Squared. It penalizes the score based on the number of features relative to the number of data points. If you add a useless variable, Adjusted R² will actually go down, telling you the model is getting worse, not better.

1. Pearson Correlation Coefficient (r)

  • Formula: r = Σ((x - x̄)(y - ȳ)) / √( Σ(x - x̄)² * Σ(y - ȳ)² )
  • Covariance: The numerator is the covariance (how they vary together). The denominator standardizes it so r is always bounded between -1.0 and +1.0.
  • Limitations: Only captures linear relationships. Fails on curves (e.g., parabolas) and is highly sensitive to outliers.

2. Simple Linear Regression (OLS)

While correlation shows relationship strength, Regression establishes a predictive mathematical model. Equation: Y = β₀ + β₁X + ε

  • Y: Dependent variable (Target).
  • X: Independent variable (Feature).
  • β₀ (Intercept): Value of Y when X is 0.
  • β₁ (Slope): How much Y changes for a 1-unit increase in X.
  • ε (Error/Residual): The noise or unexplained variance.

3. Ordinary Least Squares (OLS) Derivation

Regression calculates the line of best fit by minimizing the Sum of Squared Errors (SSE).

  • Error (Residual) = Y_actual - Y_predicted
  • Slope Formula: β₁ = r * (S_y / S_x) (Correlation scaled by standard deviations).
  • Intercept Formula: β₀ = ȳ - β₁x̄ (The line must pass through the mean of X and Y).

4. Evaluating the Model: R-Squared (R²)

R² (Coefficient of Determination) measures model accuracy. Variance Decomposition:

  • SST (Total Sum of Squares): Total variance in Y.
  • SSR (Regression Sum of Squares): Variance explained by the line.
  • SSE (Error Sum of Squares): Unexplained variance.
  • SST = SSR + SSE
  • Formula: R² = SSR / SST = 1 - (SSE / SST)
  • Interpretation: If R² = 0.82, it means 82% of the variation in Y is perfectly explained by X.

5. The Gauss-Markov Assumptions (BLUE)

For an OLS regression to be the Best Linear Unbiased Estimator (BLUE), these must hold:

  1. Linear in parameters: The true relationship is linear.
  2. Random Sampling: Data is a representative sample.
  3. No perfect multicollinearity: Independent variables aren't perfectly correlated with each other.
  4. Zero conditional mean: Errors are random (E(ε|X) = 0).
  5. Homoscedasticity: Variance of errors is constant across all X. (Violations cause "fanning" in residual plots).