Pandas: Data Manipulation & Analysis
Definition: Pandas is the most important library for data manipulation and analysis in Python.It provides two primary data structures — Series (1D) and DataFrame (2D) — that make working with structured / tabular data intuitive and powerful.
import pandas as pd
---
Why Pandas?
- Load data from CSV, Excel, JSON, SQL, and more.
- Clean, filter, transform, and aggregate data effortlessly.
- Handle missing values, duplicates, and data type conversions.
- Perform group-by operations and merge/join datasets.
- Integrates seamlessly with NumPy, Matplotlib, and Scikit-learn.
---
Core Data Structures
1. Series (1D)
A labeled, one-dimensional array. Like a single column of a spreadsheet.
s = pd.Series([10, 20, 30], index=['a', 'b', 'c'])
# a 10
# b 20
# c 30
2. DataFrame (2D)
A labeled, two-dimensional table. Like a spreadsheet or SQL table.
df = pd.DataFrame({
'Name': ['Rahul', 'Priya', 'Amit'],
'Age': [21, 22, 23],
'Score': [85, 92, 78]
})
| Feature | Series | DataFrame |
|---|---|---|
| Dimensions | 1D | 2D |
| Structure | Single column | Multiple columns |
| Analogy | A column in Excel | An Excel sheet |
---
Reading & Writing Data
| Format | Read | Write |
|---|---|---|
| CSV | pd.read_csv("file.csv") | df.to_csv("out.csv") |
| Excel | pd.read_excel("file.xlsx") | df.to_excel("out.xlsx") |
| JSON | pd.read_json("file.json") | df.to_json("out.json") |
| SQL | pd.read_sql(query, conn) | df.to_sql("table", conn) |
| HTML | pd.read_html(url) | df.to_html("out.html") |
---
Exploring Data
| Method | Description |
|---|---|
df.head() | First 5 rows |
df.tail() | Last 5 rows |
df.shape | (rows, columns) |
df.info() | Column names, types, non-null counts |
df.describe() | Statistical summary (mean, std, min, max, quartiles) |
df.dtypes | Data types of each column |
df.columns | Column names |
df.isnull().sum() | Count of missing values per column |
---
Selecting Data
| Operation | Code | Description |
|---|---|---|
| Single column | df['Name'] | Returns a Series |
| Multiple columns | df[['Name', 'Age']] | Returns a DataFrame |
| Row by index | df.iloc[0] | First row (integer-based) |
| Row by label | df.loc[0] | Row with label 0 |
| Slicing | df.iloc[0:3] | First 3 rows |
| Condition | df[df['Age'] > 21] | Filter rows where Age > 21 |
| Multiple conditions | df[(df['Age'] > 20) & (df['Score'] > 80)] | AND condition |
---
Handling Missing Data
| Method | Code | Description |
|---|---|---|
| Detect | df.isnull() | Boolean mask of nulls |
| Count | df.isnull().sum() | Count nulls per column |
| Drop rows | df.dropna() | Remove rows with any null |
| Fill with value | df.fillna(0) | Replace nulls with 0 |
| Fill with mean | df['col'].fillna(df['col'].mean()) | Impute with column mean |
| Forward fill | df.fillna(method='ffill') | Carry forward last value |
---
Data Transformation
| Operation | Code |
|---|---|
| Add column | df['New'] = df['Score'] * 2 |
| Rename columns | df.rename(columns={'Old': 'New'}) |
| Drop column | df.drop('Col', axis=1) |
| Drop row | df.drop(0, axis=0) |
| Sort by column | df.sort_values('Score', ascending=False) |
| Apply function | df['Col'].apply(lambda x: x * 2) |
| Replace values | df['Col'].replace({'old': 'new'}) |
| Change type | df['Age'] = df['Age'].astype(float) |
---
Grouping & Aggregation
GroupBy splits data into groups, applies a function, and combines results.
df.groupby('Department')['Salary'].mean()
df.groupby('City').agg({'Sales': 'sum', 'Profit': 'mean'})
| Aggregation | Function |
|---|---|
| Sum | .sum() |
| Mean | .mean() |
| Count | .count() |
| Min/Max | .min(), .max() |
| Standard Deviation | .std() |
---
Merging & Joining DataFrames
| Method | Description | SQL Equivalent |
|---|---|---|
pd.merge(df1, df2, on='key') | Merge on common column | JOIN |
pd.merge(df1, df2, how='left') | Keep all rows from left | LEFT JOIN |
pd.merge(df1, df2, how='outer') | Keep all rows from both | FULL OUTER JOIN |
pd.concat([df1, df2]) | Stack vertically | UNION |
pd.concat([df1, df2], axis=1) | Stack horizontally | — |
Summary
- Pandas is the go-to library for data manipulation in Python.
- Series (1D) and DataFrame (2D) are its core data structures.
- It can read/write CSV, Excel, JSON, SQL, and more.
- Provides powerful tools for filtering, grouping, merging, and transforming data.
- Handling missing data (
isnull,fillna,dropna) is critical for real-world datasets.