Siksha Sarovar

Siksha Sarovar (sikshasarovar.com) is a free educational web application that helps students in India learn programming and prepare for academic and competitive exams. The platform offers structured coding courses (C, C++, Python, Java, HTML, CSS, PHP, Power BI, AI, Machine Learning, Data Science), complete university curriculum notes for BCA/MCA students with previous year question papers, Class 10 and Class 12 CBSE/HBSE school notes, and dedicated preparation material for SSC, UPSC, Banking, Railway and other government exams. Browsing the site is completely free and requires no account. Users may optionally sign in with Google solely to save their learning progress, quiz scores and personal preferences across devices.

Privacy Policy | Terms of Service | Contact Siksha Sarovar | About Siksha Sarovar

v4.0.9 · PWA
Siksha Sarovar logo
Siksha Sarovar
Your Learning Universe

Siksha Sarovar is a free e-learning platform for coding courses, BCA university notes and competitive exam preparation. Optional Google sign-in saves your learning progress across devices.

Initializing knowledge base…
Compiling modules 0%

Practical 3: Fill Missing Values with Zero

Lesson 3 of 15 in the free Data Visualisation and Analytics Lab notes on Siksha Sarovar, written by Rohit Jangra.

Aim

To locate missing values (NaN) in a student records DataFrame and replace them with zero using fillna(0), while learning when zero-imputation is legitimate and when it silently corrupts an analysis.

CO Mapping: CO1, CO2

Theory

Real datasets are rarely complete: survey non-response, sensor dropouts, unmatched joins and data-entry gaps all leave holes. pandas marks a hole with NaN ("Not a Number", the IEEE-754 floating-point sentinel) — which is why an integer column acquires dtype float64 the moment one value goes missing.

How data went missing decides the safe treatment:

  • MCAR — missing completely at random: dropping or simple imputation stays unbiased;
  • MAR — missingness explained by other observed columns: model-based imputation is better;
  • MNAR — missing because of the unobserved value itself (weak students skipping tests): every naive fill biases conclusions.

The imputation toolbox: dropna() (delete), fillna(constant), statistical fills (mean/median/mode), and order-aware fills for time series (ffill, bfill, interpolate()). fillna(0) is only correct when absence truly means zero — items not sold, aid not sent. For measurements such as marks or attendance, zero is the most extreme possible value, so zero-filling invents failing students. Also note that fillna returns a new DataFrame; the original is untouched unless you reassign or pass inplace=True.

Dataset

NameMarksAttendance
A8592
BNaN88
C74NaN
DNaN79

Two of four Marks values and one of four Attendance values are missing.

Procedure

  1. Import pandas and NumPy; create the frame df using np.nan for the gaps.
  2. Print df — pandas renders the holes as NaN; optionally count them with df.isnull().sum().
  3. Apply filled_df = df.fillna(0) and print the result; confirm that df itself still contains NaN (fillna returned a copy).
  4. Compare summaries: df["Marks"].mean() (NaN-aware) versus filled_df["Marks"].mean().

Interpretation of Results

Before filling, mean() skips NaN (skipna defaults to True): (85 + 74) / 2 = 79.5, an estimate based on the students actually assessed. After fillna(0): (85 + 0 + 74 + 0) / 4 = 39.75 — the class average halves because two absences were recast as zero scores. Neither number is "wrong" in itself; they answer different questions ("average of those who appeared" vs "average counting absentees as zero"). The analytical duty is to choose deliberately and disclose: any chart or report built on filled data must state the imputation rule, because downstream readers cannot see it in the output.

Common Mistakes

  1. Calling df.fillna(0) without assigning the result (or using inplace=True) and wondering why nothing changed.
  2. Zero-filling measurement columns and then reporting means or correlations — a systematic downward bias.
  3. Skipping the profiling step: df.isnull().sum() per column should precede any imputation decision.

🎯 Viva Questions

  1. What is NaN technically? An IEEE-754 floating-point sentinel; its presence upgrades integer columns to float.
  2. dropna vs fillna? Delete incomplete rows/columns versus replace the holes with values.
  3. Why did the mean fall after filling? Fabricated zeros entered the numerator while the denominator grew from 2 to 4.
  4. When is fillna(0) correct? When absence semantically equals zero — counts like items sold or donated.
  5. How do you fill different columns differently? Pass a dict: df.fillna({"Marks": df["Marks"].median(), "Attendance": 0}).
  6. What are MCAR, MAR, MNAR? Missing completely at random / at random given observables / not at random (value-dependent) — in increasing order of danger for naive fills.