Aim
To locate missing values (NaN) in a student records DataFrame and replace them with zero using fillna(0), while learning when zero-imputation is legitimate and when it silently corrupts an analysis.
CO Mapping: CO1, CO2
Theory
Real datasets are rarely complete: survey non-response, sensor dropouts, unmatched joins and data-entry gaps all leave holes. pandas marks a hole with NaN ("Not a Number", the IEEE-754 floating-point sentinel) — which is why an integer column acquires dtype float64 the moment one value goes missing.
How data went missing decides the safe treatment:
- MCAR — missing completely at random: dropping or simple imputation stays unbiased;
- MAR — missingness explained by other observed columns: model-based imputation is better;
- MNAR — missing because of the unobserved value itself (weak students skipping tests): every naive fill biases conclusions.
The imputation toolbox: dropna() (delete), fillna(constant), statistical fills (mean/median/mode), and order-aware fills for time series (ffill, bfill, interpolate()). fillna(0) is only correct when absence truly means zero — items not sold, aid not sent. For measurements such as marks or attendance, zero is the most extreme possible value, so zero-filling invents failing students. Also note that fillna returns a new DataFrame; the original is untouched unless you reassign or pass inplace=True.
Dataset
| Name | Marks | Attendance |
|---|---|---|
| A | 85 | 92 |
| B | NaN | 88 |
| C | 74 | NaN |
| D | NaN | 79 |
Two of four Marks values and one of four Attendance values are missing.
Procedure
- Import pandas and NumPy; create the frame
dfusingnp.nanfor the gaps. - Print
df— pandas renders the holes asNaN; optionally count them withdf.isnull().sum(). - Apply
filled_df = df.fillna(0)and print the result; confirm thatdfitself still contains NaN (fillna returned a copy). - Compare summaries:
df["Marks"].mean()(NaN-aware) versusfilled_df["Marks"].mean().
Interpretation of Results
Before filling, mean() skips NaN (skipna defaults to True): (85 + 74) / 2 = 79.5, an estimate based on the students actually assessed. After fillna(0): (85 + 0 + 74 + 0) / 4 = 39.75 — the class average halves because two absences were recast as zero scores. Neither number is "wrong" in itself; they answer different questions ("average of those who appeared" vs "average counting absentees as zero"). The analytical duty is to choose deliberately and disclose: any chart or report built on filled data must state the imputation rule, because downstream readers cannot see it in the output.
Common Mistakes
- Calling
df.fillna(0)without assigning the result (or usinginplace=True) and wondering why nothing changed. - Zero-filling measurement columns and then reporting means or correlations — a systematic downward bias.
- Skipping the profiling step:
df.isnull().sum()per column should precede any imputation decision.
🎯 Viva Questions
- What is NaN technically? An IEEE-754 floating-point sentinel; its presence upgrades integer columns to float.
- dropna vs fillna? Delete incomplete rows/columns versus replace the holes with values.
- Why did the mean fall after filling? Fabricated zeros entered the numerator while the denominator grew from 2 to 4.
- When is fillna(0) correct? When absence semantically equals zero — counts like items sold or donated.
- How do you fill different columns differently? Pass a dict:
df.fillna({"Marks": df["Marks"].median(), "Attendance": 0}). - What are MCAR, MAR, MNAR? Missing completely at random / at random given observables / not at random (value-dependent) — in increasing order of danger for naive fills.