Siksha Sarovar

Siksha Sarovar (sikshasarovar.com) is a free educational web application that helps students in India learn programming and prepare for academic and competitive exams. The platform offers structured coding courses (C, C++, Python, Java, HTML, CSS, PHP, Power BI, AI, Machine Learning, Data Science), complete university curriculum notes for BCA/MCA students with previous year question papers, Class 10 and Class 12 CBSE/HBSE school notes, and dedicated preparation material for SSC, UPSC, Banking, Railway and other government exams. Browsing the site is completely free and requires no account. Users may optionally sign in with Google solely to save their learning progress, quiz scores and personal preferences across devices.

Privacy Policy | Terms of Service | Contact Siksha Sarovar | About Siksha Sarovar

v4.0.9 · PWA
Siksha Sarovar logo
Siksha Sarovar
Your Learning Universe

Siksha Sarovar is a free e-learning platform for coding courses, BCA university notes and competitive exam preparation. Optional Google sign-in saves your learning progress across devices.

Initializing knowledge base…
Compiling modules 0%

Practical 4: Combine DataFrames using concat(), join(), merge()

Lesson 4 of 15 in the free Data Visualisation and Analytics Lab notes on Siksha Sarovar, written by Rohit Jangra.

Aim

To combine related DataFrames using the three pandas combination tools — concat(), join() and merge() — and to understand which tool answers which analytical question.

CO Mapping: CO1, CO2

Theory

Real analyses rarely live in one table: customers sit in one file, cities in another, marks in a third. pandas offers three combination idioms with very different semantics:

  • concat() stacks DataFrames along an axis. axis=0 appends rows (same columns); axis=1 places tables side by side aligned by index, not by any key — it performs no matching logic at all.
  • join() combines on the index. It is convenient when the joining key already is the index; its default is how="left".
  • merge() is the full SQL-style join on columns: how="inner" keeps only keys present in both tables, "left"/"right" keep one side complete (NaN where the other side is missing), and "outer" keeps everything.

Choosing among them is really choosing a join semantics: What is the key? Which side's rows must survive? What should happen to unmatched keys? A second silent hazard is cardinality: joining on a duplicated key multiplies rows (1:m or m:m "fan-out"), which quietly inflates any sum computed afterwards.

Dataset

Three deliberately imperfect tables:

df1 — students

IDName
1Aman
2Riya
3Kabir

df2 — cities (ID 1 missing, ID 4 extra)

IDCity
2Delhi
3Noida
4Pune

df3 — marks (no ID column at all)

Marks
88
91
79

Procedure

  1. Build df1, df2 and df3 and print the first two.
  2. Run pd.concat([df1, df3], axis=1) — Marks are glued beside df1 purely by row position (index 0, 1, 2).
  3. Run df1.set_index("ID").join(df2.set_index("ID"), how="left") — an index join keeping every row of df1.
  4. Run pd.merge(df1, df2, on="ID", how="inner") — a column join keeping only IDs 2 and 3.
  5. Compare the row counts and NaN patterns of the three outputs against the input tables.

Interpretation of Results

Three different answers emerge from the same tables — because three different questions were asked. concat assumes the n-th mark belongs to the n-th student: positional trust that breaks silently if either table is re-sorted. The left join keeps Aman with City = NaN — "all students, with city where known". The inner merge drops both Aman (no city) and Pune's ID 4 (no student) — "only fully matched records". An analyst must pick the semantics that matches the business question, not whichever call happens to run without error.

Common Mistakes

  1. Using concat(axis=1) on tables sorted differently — values pair up wrongly and no error is raised.
  2. Merging on keys with duplicates and not noticing the row explosion — compare len() before and after every merge.
  3. Forgetting that inner joins silently discard unmatched rows, so totals computed afterwards understate reality.

🎯 Viva Questions

  1. concat vs merge? concat stacks along an axis with index alignment; merge matches rows by key columns like SQL joins.
  2. Default join type of merge()? how="inner".
  3. When is join() preferable? When the joining key is already the index on both frames.
  4. What would an outer merge produce here? IDs 1–4, with NaN filling the missing Name/City side.
  5. What is join fan-out? Row multiplication when the key is duplicated on one or both sides (1:m, m:m).
  6. How do you append rows instead of columns? pd.concat([a, b], axis=0, ignore_index=True).