Siksha Sarovar

Siksha Sarovar (sikshasarovar.com) is a free educational web application that helps students in India learn programming and prepare for academic and competitive exams. The platform offers structured coding courses (C, C++, Python, Java, HTML, CSS, PHP, Power BI, AI, Machine Learning, Data Science), complete university curriculum notes for BCA/MCA students with previous year question papers, Class 10 and Class 12 CBSE/HBSE school notes, and dedicated preparation material for SSC, UPSC, Banking, Railway and other government exams. Browsing the site is completely free and requires no account. Users may optionally sign in with Google solely to save their learning progress, quiz scores and personal preferences across devices.

Privacy Policy | Terms of Service | Contact Siksha Sarovar | About Siksha Sarovar

v4.0.9 · PWA
Siksha Sarovar logo
Siksha Sarovar
Your Learning Universe

Siksha Sarovar is a free e-learning platform for coding courses, BCA university notes and competitive exam preparation. Optional Google sign-in saves your learning progress across devices.

Initializing knowledge base…
Compiling modules 0%

Scikit-learn: Machine Learning

Lesson 29 of 37 in the free Data Science notes on Siksha Sarovar, written by Rohit Jangra.

Scikit-learn: Machine Learning in Python

Definition: Scikit-learn (sklearn) is the most popular machine learning library in Python. It provides simple and efficient tools for data mining, data analysis, and machine learning — including classification, regression, clustering, dimensionality reduction, and model evaluation.

import sklearn

---

Why Scikit-learn?

FeatureBenefit
Simple APIConsistent interface: fit(), predict(), transform()
Wide CoverageCovers most ML algorithms
Well-DocumentedExcellent documentation with examples
IntegrationWorks with NumPy, Pandas, Matplotlib
Production ReadyUsed in industry and academia

---

The Scikit-learn Workflow

1. Import Data → 2. Preprocess → 3. Split (Train/Test)
→ 4. Choose Model → 5. Train (fit) → 6. Predict
→ 7. Evaluate → 8. Tune Hyperparameters

---

Categories of ML Algorithms in Scikit-learn

CategoryTypeAlgorithmsUse Case
Supervised - ClassificationPredicts categoriesLogistic Regression, Decision Tree, Random Forest, SVM, KNNSpam detection, disease diagnosis
Supervised - RegressionPredicts continuous valuesLinear Regression, Ridge, Lasso, Decision Tree RegressorHouse price prediction, sales forecast
Unsupervised - ClusteringGroups similar dataK-Means, DBSCAN, HierarchicalCustomer segmentation
Unsupervised - Dimensionality ReductionReduces featuresPCA, t-SNE, LDAVisualization, feature reduction

---

Data Preprocessing

Train-Test Split

from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=0.2, random_state=42
)

Feature Scaling

MethodClassWhen to Use
StandardizationStandardScaler()When data is normally distributed
NormalizationMinMaxScaler()When you need values in [0, 1]
Robust ScalingRobustScaler()When data has outliers
from sklearn.preprocessing import StandardScaler
scaler = StandardScaler()
X_train_scaled = scaler.fit_transform(X_train)
X_test_scaled = scaler.transform(X_test)

Label Encoding (Categorical → Numeric)

MethodClassExample
Label EncodingLabelEncoder()Male → 0, Female → 1
One-Hot EncodingOneHotEncoder()Red → [1,0,0], Green → [0,1,0]

---

The Universal Scikit-learn API

Every Scikit-learn model follows the same pattern:

from sklearn.model_name import ModelClass

# 1. Create model
model = ModelClass(hyperparameters)

# 2. Train
model.fit(X_train, y_train)

# 3. Predict
y_pred = model.predict(X_test)

# 4. Evaluate
score = model.score(X_test, y_test)

---

Key Algorithms at a Glance

AlgorithmTypeProsCons
Linear RegressionRegressionSimple, interpretableAssumes linearity
Logistic RegressionClassificationFast, good baselineLinear decision boundary
Decision TreeBothEasy to visualizeOverfits easily
Random ForestBothHandles overfitting, versatileSlower, less interpretable
SVMBothEffective in high dimensionsSlow on large datasets
KNNBothNo training neededSlow prediction, memory heavy
K-MeansClusteringSimple, scalableNeed to specify K

---

Model Evaluation

Classification Metrics

MetricDescriptionWhen to Use
AccuracyCorrect predictions / TotalBalanced classes
PrecisionTP / (TP + FP)Minimize false positives (spam detection)
RecallTP / (TP + FN)Minimize false negatives (disease detection)
F1 ScoreHarmonic mean of Precision & RecallImbalanced classes
Confusion MatrixTP, TN, FP, FN tableDetailed error analysis

Regression Metrics

MetricDescription
MAE (Mean Absolute Error)Average absolute difference
MSE (Mean Squared Error)Average squared difference
RMSE (Root MSE)Square root of MSE (same unit as target)
R² ScoreProportion of variance explained (0 to 1)

---

Cross-Validation

Instead of a single train-test split, cross-validation uses multiple splits for more reliable evaluation:

from sklearn.model_selection import cross_val_score
scores = cross_val_score(model, X, y, cv=5, scoring='accuracy')
print(f"Mean Accuracy: {scores.mean():.2f} ± {scores.std():.2f}")

Summary

  • Scikit-learn provides a consistent API for all ML algorithms: fit(), predict(), score().
  • Preprocessing (scaling, encoding, splitting) is critical before training.
  • Classification, regression, clustering, and dimensionality reduction are all supported.
  • Model evaluation metrics (accuracy, precision, recall, F1, R²) guide model selection.
  • Cross-validation provides more reliable performance estimates than a single split.