Siksha Sarovar

Siksha Sarovar (sikshasarovar.com) is a free educational web application that helps students in India learn programming and prepare for academic and competitive exams. The platform offers structured coding courses (C, C++, Python, Java, HTML, CSS, PHP, Power BI, AI, Machine Learning, Data Science), complete university curriculum notes for BCA/MCA students with previous year question papers, Class 10 and Class 12 CBSE/HBSE school notes, and dedicated preparation material for SSC, UPSC, Banking, Railway and other government exams. Browsing the site is completely free and requires no account. Users may optionally sign in with Google solely to save their learning progress, quiz scores and personal preferences across devices.

Privacy Policy | Terms of Service | Contact Siksha Sarovar | About Siksha Sarovar

v4.0.9 · PWA
Siksha Sarovar logo
Siksha Sarovar
Your Learning Universe

Siksha Sarovar is a free e-learning platform for coding courses, BCA university notes and competitive exam preparation. Optional Google sign-in saves your learning progress across devices.

Initializing knowledge base…
Compiling modules 0%

2.11 Statistical Paradoxes in Analytics

Lesson 19 of 32 in the free Data Visualisation and Analytics notes on Siksha Sarovar, written by Rohit Jangra.

Statistical Paradoxes: When Math Defies Logic

1. Simpson’s Paradox

A phenomenon where a trend appears in isolated subgroups of data, but disappears or reverses when the groups are combined.

  • Root Cause: A hidden confounding variable with heavily unbalanced sizes across groups.
  • Classic Example: A hospital tests Treatment A and Treatment B.
  • Mild Cases: Trt A cures 90%, Trt B cures 80%. (A is better).
  • Severe Cases: Trt A cures 20%, Trt B cures 10%. (A is better).
  • Aggregated: Trt B looks better overall! Why? Because Trt A was given to 90% of the Severe cases, dragging its average down.
  • Solution: Always stratify data by critical confounders (severity, department, user-type) before aggregating.

2. Base Rate Fallacy (False Positive Paradox)

Ignoring the base probability of an event when assessing test accuracy.

  • Example: A facial recognition security system is 99% accurate (1% false positive rate). 10,000 employees pass through. Only 1 is a known threat (Base rate).
  • The system flags the threat (True Positive).
  • It also flags 1% of the 9,999 normal employees = 100 people (False Positives).
  • Paradox: If the alarm sounds, what is the probability the person is actually a threat? 1 / 101 = ~1%. Despite a "99% accurate" system, an alarm is wrong 99% of the time!

3. P-Hacking (Data Dredging)

The unethical practice of exhausting different variables, transformations, or subsets of data until a statistically significant result (p < 0.05) is found by sheer chance.

  • If you test 20 random, unrelated hypotheses at α=0.05, you mathematically expect at least 1 to be significant purely by luck.
  • Solution in ML: Strict train/validation/test splits, preregistering hypotheses, and using False Discovery Rate (FDR) corrections.

4. Overfitting & Bias-Variance Tradeoff

In regression, adding more variables (X) will always increase R², even if the variables are garbage.

  • Overfitting: Creating a model so complex that it perfectly maps the training data (memorizing noise) but completely fails on unseen data.
  • Solution: Use Adjusted R² (which penalizes adding useless variables) or Cross-Validation.

5. Survivorship Bias

Drawing conclusions based only on subjects that "survived" a process, ignoring those that didn't.

  • Example: WWII planes returning with bullet holes in the wings. Military wanted to armor the wings. Statistician Abraham Wald realized the planes shot in the engines didn't survive to return. The armor needed to go where the bullet holes weren't.
  • Data Analytics: Analyzing only the attributes of "current successful startups" while ignoring the 90% of startups that failed with those exact same attributes.