Siksha Sarovar

Siksha Sarovar (sikshasarovar.com) is a free educational web application that helps students in India learn programming and prepare for academic and competitive exams. The platform offers structured coding courses (C, C++, Python, Java, HTML, CSS, PHP, Power BI, AI, Machine Learning, Data Science), complete university curriculum notes for BCA/MCA students with previous year question papers, Class 10 and Class 12 CBSE/HBSE school notes, and dedicated preparation material for SSC, UPSC, Banking, Railway and other government exams. Browsing the site is completely free and requires no account. Users may optionally sign in with Google solely to save their learning progress, quiz scores and personal preferences across devices.

Privacy Policy | Terms of Service | Contact Siksha Sarovar | About Siksha Sarovar

v4.0.9 · PWA
Siksha Sarovar logo
Siksha Sarovar
Your Learning Universe

Siksha Sarovar is a free e-learning platform for coding courses, BCA university notes and competitive exam preparation. Optional Google sign-in saves your learning progress across devices.

Initializing knowledge base…
Compiling modules 0%

Practical 12: Correlation Between Random Numbers

Lesson 12 of 15 in the free Data Visualisation and Analytics Lab notes on Siksha Sarovar, written by Rohit Jangra.

Aim

To generate two related random number series with NumPy (Y built as roughly 0.7 × X plus noise), place them in a DataFrame, and measure their linear relationship with the Pearson correlation coefficient via df["X"].corr(df["Y"]).

CO Mapping: CO1, CO2, CO4, CO5

Theory

The Pearson correlation coefficient r measures the strength and direction of the linear relationship between two variables. It is the covariance of X and Y divided by the product of their standard deviations:

r = cov(X, Y) / (σ_X · σ_Y)

That normalisation forces r into the fixed range −1 to +1, making it unit-free and comparable across datasets (unlike raw covariance — see Practical 13). Reading the value: r = +1 means all points lie exactly on an upward-sloping line, r = −1 on a downward one, r = 0 means no linear association. Rules of thumb: |r| above 0.7 is usually called strong, 0.3–0.7 moderate, below 0.3 weak.

This practical constructs correlation deliberately. x is 30 random integers between 10 and 99; y is defined as 0.7 * x plus uniform noise between −10 and +9. Y therefore contains a genuine linear signal (slope 0.7) partially obscured by noise — so r should come out strongly positive but not exactly 1. Because np.random.seed(42) fixes the generator, every run (and every student) gets identical numbers, a reproducibility discipline carried over from Practical 6.

Two standing warnings: correlation is not causation — here we happen to know Y is computed from X because we wrote the code, but with real-world data an r of 0.96 says nothing about direction or mechanism (a lurking third variable may drive both). And r only sees straight lines: a perfect parabola can score r ≈ 0, so always scatter-plot before trusting the number.

Dataset

Generated in the snippet with np.random.seed(42) — 30 pairs; the first five rows printed by df.head() are:

IndexXY
06147
12420
28160
37057
43022

Procedure

  1. Seed NumPy with np.random.seed(42) so the "random" data is reproducible.
  2. Generate x = np.random.randint(10, 100, 30) — 30 integers in [10, 99].
  3. Build y = (0.7 * x + np.random.randint(-10, 10, 30)).astype(int) — a linear function of x with additive noise in [−10, 9].
  4. Assemble both arrays into the DataFrame df with columns X and Y and print df.head().
  5. Compute corr_value = df["X"].corr(df["Y"]) — pandas' .corr() defaults to the Pearson method.
  6. Print the coefficient rounded to 4 decimal places.

Interpretation of Results

With seed 42 the program prints r = 0.9589 — a very strong positive linear correlation, exactly what the construction predicts. Sanity-check it against the printed head: when X is large (81) Y is large (60); when X is small (24) Y is small (20) — the pairs rise and fall together. Why not r = 1.0? The noise term adds up to ±10 marks of scatter around the perfect line 0.7x; relative to X's wide spread (values 11–98, so the signal component spans roughly 8–69), that noise is small, which is why r stays near 1 rather than degrading toward 0.7 or lower. The lesson in reverse: had the noise range been ±50 with the same slope, r would fall sharply — correlation strength reflects the signal-to-noise ratio, not the slope itself. The slope 0.7 sets the direction (positive); the tight noise sets the magnitude (≈ 0.96).

Common Mistakes

  1. Reading the 0.7 in the formula as the expected correlation — 0.7 is the slope; r depends on how big the noise is relative to the signal, and here lands at 0.9589.
  2. Omitting the seed and then puzzling over why the printed coefficient differs between runs and from classmates' outputs.
  3. Declaring "X causes Y" from a high r on real data — correlation is symmetric and silent about causation; only here do we know the mechanism because we coded it.

🎯 Viva Questions

  1. What range can the Pearson r take? −1 to +1 inclusive; the sign gives direction, the magnitude gives strength of the linear relationship.
  2. How is r related to covariance? r = cov(X, Y) / (σ_X σ_Y) — covariance normalised by both standard deviations.
  3. Why is r here 0.9589 and not exactly 1? The uniform noise in [−10, 9] scatters points around the line y = 0.7x, weakening the fit slightly.
  4. What would r ≈ 0 mean? No linear association — though a strong non-linear relationship (e.g. parabolic) could still exist.
  5. What does np.random.seed(42) achieve? Deterministic pseudo-random output, so results are reproducible across runs and machines.
  6. Does r = 0.96 prove X causes Y? No — correlation never establishes causation; here we only know the mechanism because y was explicitly computed from x.