Siksha Sarovar

Siksha Sarovar (sikshasarovar.com) is a free educational web application that helps students in India learn programming and prepare for academic and competitive exams. The platform offers structured coding courses (C, C++, Python, Java, HTML, CSS, PHP, Power BI, AI, Machine Learning, Data Science), complete university curriculum notes for BCA/MCA students with previous year question papers, Class 10 and Class 12 CBSE/HBSE school notes, and dedicated preparation material for SSC, UPSC, Banking, Railway and other government exams. Browsing the site is completely free and requires no account. Users may optionally sign in with Google solely to save their learning progress, quiz scores and personal preferences across devices.

Privacy Policy | Terms of Service | Contact Siksha Sarovar | About Siksha Sarovar

v4.0.9 · PWA
Siksha Sarovar logo
Siksha Sarovar
Your Learning Universe

Siksha Sarovar is a free e-learning platform for coding courses, BCA university notes and competitive exam preparation. Optional Google sign-in saves your learning progress across devices.

Initializing knowledge base…
Compiling modules 0%

2.3 Advanced Sampling Theory

Lesson 11 of 32 in the free Data Visualisation and Analytics notes on Siksha Sarovar, written by Rohit Jangra.

Sampling Theory: Bridging Sample and Population

1. The Core Objective

In data analytics, we rarely have access to the entire Population (N). Instead, we take a Sample (n) to estimate population properties.

Study Deep: Degrees of Freedom (df)

Degrees of freedom represent the number of independent pieces of information that go into the calculation of a statistic.

  • Intuition: If you know the sum of 5 numbers, you can freely pick any 4 numbers, but the 5th number is "fixed" to make the sum correct.
  • Application: When calculating Sample Variance, we divide by $(n - 1)$ instead of $n$ to provide an unbiased estimate of the population variance. This is called Bessel's Correction.

1. The Core Objective

  • Parameter: A fixed but unknown value of the population (e.g., Population Mean μ, Variance σ²).
  • Statistic: A known value calculated from the sample (e.g., Sample Mean , Sample Variance ), used as an estimator for the parameter.

2. Standard Error vs. Standard Deviation

  • Standard Deviation (σ): Measures the spread of individual data points around the mean.
  • Standard Error (SE): Measures the spread of sample means around the true population mean. It represents our uncertainty in the estimate.
  • Formula: SE = σ / √n
  • Insight: To halve your error margin, you must quadruple (4x) your sample size!

3. The Central Limit Theorem (CLT) - Mathematical Intuition

The CLT is the most powerful theorem in statistics. Definition: As the sample size n increases, the distribution of the sample means will approach a Normal Distribution, regardless of the original population's shape (even if highly skewed or uniform).

  • Rule of Thumb: n ≥ 30 is generally considered sufficient for the CLT to apply.
  • Implication for BCA: Because of the CLT, we can safely use Normal-based algorithms and Z-tests on massive datasets (like user logs or sensor data), even if the raw data is completely non-normal.

4. Degrees of Freedom (df)

Degrees of freedom represent the number of independent values that can vary in an analysis without breaking any constraints.

  • Example: If you know the mean of 5 numbers is 10, the first 4 numbers can be anything. But the 5th number is strictly determined to make the mean 10. Thus, df = 5 - 1 = 4.
  • Dividing by n-1 (Bessel's correction) instead of n when calculating sample variance () provides an unbiased estimate of the population variance.

5. Derived Sampling Distributions

When we compute statistics from normal populations, they follow specific distributions:

  1. Student's t-Distribution: Used when estimating the mean and σ is unknown. It has heavier tails than the Normal distribution to account for the extra uncertainty of estimating σ with s. As df increases, t approaches Normal.
  2. Chi-Square (χ²) Distribution: The distribution of the sum of squared standard normal variables. Used for testing variances and categorical independence.
  3. F-Distribution: The ratio of two Chi-Square variables. Used in ANOVA to compare variances between multiple groups.