Siksha Sarovar

Siksha Sarovar (sikshasarovar.com) is a free educational web application that helps students in India learn programming and prepare for academic and competitive exams. The platform offers structured coding courses (C, C++, Python, Java, HTML, CSS, PHP, Power BI, AI, Machine Learning, Data Science), complete university curriculum notes for BCA/MCA students with previous year question papers, Class 10 and Class 12 CBSE/HBSE school notes, and dedicated preparation material for SSC, UPSC, Banking, Railway and other government exams. Browsing the site is completely free and requires no account. Users may optionally sign in with Google solely to save their learning progress, quiz scores and personal preferences across devices.

Privacy Policy | Terms of Service | Contact Siksha Sarovar | About Siksha Sarovar

v4.0.9 · PWA
Siksha Sarovar logo
Siksha Sarovar
Your Learning Universe

Siksha Sarovar is a free e-learning platform for coding courses, BCA university notes and competitive exam preparation. Optional Google sign-in saves your learning progress across devices.

Initializing knowledge base…
Compiling modules 0%

Pandas: Data Manipulation

Lesson 26 of 37 in the free Data Science notes on Siksha Sarovar, written by Rohit Jangra.

Pandas: Data Manipulation & Analysis

Definition: Pandas is the most important library for data manipulation and analysis in Python.It provides two primary data structures — Series (1D) and DataFrame (2D) — that make working with structured / tabular data intuitive and powerful.

import pandas as pd

---

Why Pandas?

  • Load data from CSV, Excel, JSON, SQL, and more.
  • Clean, filter, transform, and aggregate data effortlessly.
  • Handle missing values, duplicates, and data type conversions.
  • Perform group-by operations and merge/join datasets.
  • Integrates seamlessly with NumPy, Matplotlib, and Scikit-learn.

---

Core Data Structures

1. Series (1D)

A labeled, one-dimensional array. Like a single column of a spreadsheet.

s = pd.Series([10, 20, 30], index=['a', 'b', 'c'])
# a    10
# b    20
# c    30

2. DataFrame (2D)

A labeled, two-dimensional table. Like a spreadsheet or SQL table.

df = pd.DataFrame({
    'Name': ['Rahul', 'Priya', 'Amit'],
    'Age': [21, 22, 23],
    'Score': [85, 92, 78]
})
FeatureSeriesDataFrame
Dimensions1D2D
StructureSingle columnMultiple columns
AnalogyA column in ExcelAn Excel sheet

---

Reading & Writing Data

FormatReadWrite
CSVpd.read_csv("file.csv")df.to_csv("out.csv")
Excelpd.read_excel("file.xlsx")df.to_excel("out.xlsx")
JSONpd.read_json("file.json")df.to_json("out.json")
SQLpd.read_sql(query, conn)df.to_sql("table", conn)
HTMLpd.read_html(url)df.to_html("out.html")

---

Exploring Data

MethodDescription
df.head()First 5 rows
df.tail()Last 5 rows
df.shape(rows, columns)
df.info()Column names, types, non-null counts
df.describe()Statistical summary (mean, std, min, max, quartiles)
df.dtypesData types of each column
df.columnsColumn names
df.isnull().sum()Count of missing values per column

---

Selecting Data

OperationCodeDescription
Single columndf['Name']Returns a Series
Multiple columnsdf[['Name', 'Age']]Returns a DataFrame
Row by indexdf.iloc[0]First row (integer-based)
Row by labeldf.loc[0]Row with label 0
Slicingdf.iloc[0:3]First 3 rows
Conditiondf[df['Age'] > 21]Filter rows where Age > 21
Multiple conditionsdf[(df['Age'] > 20) & (df['Score'] > 80)]AND condition

---

Handling Missing Data

MethodCodeDescription
Detectdf.isnull()Boolean mask of nulls
Countdf.isnull().sum()Count nulls per column
Drop rowsdf.dropna()Remove rows with any null
Fill with valuedf.fillna(0)Replace nulls with 0
Fill with meandf['col'].fillna(df['col'].mean())Impute with column mean
Forward filldf.fillna(method='ffill')Carry forward last value

---

Data Transformation

OperationCode
Add columndf['New'] = df['Score'] * 2
Rename columnsdf.rename(columns={'Old': 'New'})
Drop columndf.drop('Col', axis=1)
Drop rowdf.drop(0, axis=0)
Sort by columndf.sort_values('Score', ascending=False)
Apply functiondf['Col'].apply(lambda x: x * 2)
Replace valuesdf['Col'].replace({'old': 'new'})
Change typedf['Age'] = df['Age'].astype(float)

---

Grouping & Aggregation

GroupBy splits data into groups, applies a function, and combines results.

df.groupby('Department')['Salary'].mean()
df.groupby('City').agg({'Sales': 'sum', 'Profit': 'mean'})
AggregationFunction
Sum.sum()
Mean.mean()
Count.count()
Min/Max.min(), .max()
Standard Deviation.std()

---

Merging & Joining DataFrames

MethodDescriptionSQL Equivalent
pd.merge(df1, df2, on='key')Merge on common columnJOIN
pd.merge(df1, df2, how='left')Keep all rows from leftLEFT JOIN
pd.merge(df1, df2, how='outer')Keep all rows from bothFULL OUTER JOIN
pd.concat([df1, df2])Stack verticallyUNION
pd.concat([df1, df2], axis=1)Stack horizontally—

Summary

  • Pandas is the go-to library for data manipulation in Python.
  • Series (1D) and DataFrame (2D) are its core data structures.
  • It can read/write CSV, Excel, JSON, SQL, and more.
  • Provides powerful tools for filtering, grouping, merging, and transforming data.
  • Handling missing data (isnull, fillna, dropna) is critical for real-world datasets.