Aim
To combine related DataFrames using the three pandas combination tools — concat(), join() and merge() — and to understand which tool answers which analytical question.
CO Mapping: CO1, CO2
Theory
Real analyses rarely live in one table: customers sit in one file, cities in another, marks in a third. pandas offers three combination idioms with very different semantics:
concat()stacks DataFrames along an axis.axis=0appends rows (same columns);axis=1places tables side by side aligned by index, not by any key — it performs no matching logic at all.join()combines on the index. It is convenient when the joining key already is the index; its default ishow="left".merge()is the full SQL-style join on columns:how="inner"keeps only keys present in both tables,"left"/"right"keep one side complete (NaN where the other side is missing), and"outer"keeps everything.
Choosing among them is really choosing a join semantics: What is the key? Which side's rows must survive? What should happen to unmatched keys? A second silent hazard is cardinality: joining on a duplicated key multiplies rows (1:m or m:m "fan-out"), which quietly inflates any sum computed afterwards.
Dataset
Three deliberately imperfect tables:
df1 — students
| ID | Name |
|---|---|
| 1 | Aman |
| 2 | Riya |
| 3 | Kabir |
df2 — cities (ID 1 missing, ID 4 extra)
| ID | City |
|---|---|
| 2 | Delhi |
| 3 | Noida |
| 4 | Pune |
df3 — marks (no ID column at all)
| Marks |
|---|
| 88 |
| 91 |
| 79 |
Procedure
- Build
df1,df2anddf3and print the first two. - Run
pd.concat([df1, df3], axis=1)— Marks are glued beside df1 purely by row position (index 0, 1, 2). - Run
df1.set_index("ID").join(df2.set_index("ID"), how="left")— an index join keeping every row of df1. - Run
pd.merge(df1, df2, on="ID", how="inner")— a column join keeping only IDs 2 and 3. - Compare the row counts and NaN patterns of the three outputs against the input tables.
Interpretation of Results
Three different answers emerge from the same tables — because three different questions were asked. concat assumes the n-th mark belongs to the n-th student: positional trust that breaks silently if either table is re-sorted. The left join keeps Aman with City = NaN — "all students, with city where known". The inner merge drops both Aman (no city) and Pune's ID 4 (no student) — "only fully matched records". An analyst must pick the semantics that matches the business question, not whichever call happens to run without error.
Common Mistakes
- Using
concat(axis=1)on tables sorted differently — values pair up wrongly and no error is raised. - Merging on keys with duplicates and not noticing the row explosion — compare
len()before and after every merge. - Forgetting that inner joins silently discard unmatched rows, so totals computed afterwards understate reality.
🎯 Viva Questions
- concat vs merge? concat stacks along an axis with index alignment; merge matches rows by key columns like SQL joins.
- Default join type of merge()?
how="inner". - When is join() preferable? When the joining key is already the index on both frames.
- What would an outer merge produce here? IDs 1–4, with NaN filling the missing Name/City side.
- What is join fan-out? Row multiplication when the key is duplicated on one or both sides (1:m, m:m).
- How do you append rows instead of columns?
pd.concat([a, b], axis=0, ignore_index=True).