Machine Learning II — Free Notes & Tutorial
Free Advanced Machine Learning (ML2) notes for BCA — deep learning, ensemble methods, NLP at SikshaSarovar.
This Machine Learning II course is part of Siksha Sarovar and is 100% free for students in India — no sign-up required to read. It contains 22 structured lessons with examples, and pairs with our free online compiler and AI tutor.
What you will learn
- Deep learning
- Ensemble methods
- NLP
- Advanced ML
Course content (22 lessons)
- Unit I Overview: Combining Different Models — Unit I Overview: Combining Different Models Ensemble methods combine multiple base learners to produce a more powerful predictor. Condorcet's Jury Theorem (1785) shows that when…
- 1. Evaluating ML Algorithms & Model Selection — 1. Evaluating ML Algorithms and Model Selection Model evaluation is the foundation of trustworthy ML. Without rigorous evaluation, we cannot distinguish genuine generalization…
- 2. Introduction to Statistical Learning Theory — 2. Introduction to Statistical Learning Theory Statistical Learning Theory (SLT) provides mathematical foundations for understanding when and why ML generalizes. Developed by…
- 3. Ensemble Methods: Boosting — 3. Ensemble Methods: Boosting Boosting is a sequential ensemble method that converts many weak learners (models slightly better than random guessing) into a powerful strong…
- 4. Ensemble Methods: Bagging — 4. Ensemble Methods: Bagging Bagging (Bootstrap AGGregatING), introduced by Breiman (1994), is a parallel ensemble method that reduces variance by training multiple models on…
- 5. Ensemble Methods: Random Forests — 5. Ensemble Methods: Random Forests Random Forests, introduced by Breiman (2001), extend bagging by adding random feature selection at each split, further decorrelating base trees…
- Unit II Overview: Dimensionality Reduction — Unit II Overview: Dimensionality Reduction High-dimensional data poses fundamental computational and statistical challenges. Dimensionality reduction methods find compact,…
- 6. Linear Discriminant Analysis (LDA) — 6. Linear Discriminant Analysis (LDA) Linear Discriminant Analysis was introduced by R.A. Fisher in 1936 as a method to find the linear combination of features that best separates…
- 7. Principal Component Analysis (PCA) — 7. Principal Component Analysis (PCA) PCA is the most widely used unsupervised dimensionality reduction technique. Developed by Pearson (1901) and Hotelling (1933), PCA finds an…
- 8. Kernel PCA — 8. Kernel PCA Kernel PCA extends PCA to non-linearly separable data by first mapping inputs to a high-dimensional feature space using a kernel function, then applying PCA in that…
- 9. Factor Analysis — 9. Factor Analysis Factor Analysis (FA) is a probabilistic generative model that explains observed variables as linear combinations of a small number of latent factors plus unique…
- 10. Independent Component Analysis (ICA) — 10. Independent Component Analysis (ICA) ICA is a computational technique for separating a multivariate signal into statistically independent non-Gaussian components. The classic…
- Unit III Overview: Learning With Neural Networks — Unit III Overview: Learning With Neural Networks Artificial Neural Networks (ANNs) are inspired by the biological neural networks in animal brains. The modern deep learning…
- 11. The Perceptron — 11. The Perceptron The Perceptron, invented by Frank Rosenblatt in 1957, was the first algorithmically described neural network. It sparked enormous optimism but was later shown…
- 12. Multilayer Neural Networks & Backpropagation — 12. Multilayer Neural Networks and Backpropagation Multilayer Perceptrons (MLPs) overcome the linearity limitation of single perceptrons by stacking layers of neurons with…
- 13. Learning Neural Network Structures — 13. Learning Neural Network Structures The architecture and regularization of a neural network are just as important as the training algorithm itself. This lesson covers practical…
- 14. Deep Learning & Feature Representation Learning — 14. Deep Learning and Feature Representation Learning Deep learning is characterized by learning hierarchical feature representations directly from raw data. Rather than…
- Unit IV Overview: Reinforcement Learning — Unit IV Overview: Reinforcement Learning Reinforcement Learning (RL) addresses the problem of learning to act in an environment to maximize cumulative reward. Unlike supervised…
- 15. Elements of Reinforcement Learning — 15. Elements of Reinforcement Learning This lesson formalizes the key components of the RL framework: policies, value functions, and the role of the discount factor in shaping…
- 16. Generalization in Reinforcement Learning — 16. Generalization in Reinforcement Learning Generalization in RL is the ability to perform well on states not seen during training. Deep RL extends the Q-function and policy to…
- 17. Policy Search — 17. Policy Search Policy search methods directly optimize the policy parameters without necessarily learning a value function. They are particularly powerful for continuous action…
- 18. Adaptive Dynamic Programming — 18. Adaptive Dynamic Programming Adaptive Dynamic Programming (ADP) bridges dynamic programming (which requires a model) and model-free RL (which learns from experience). It…
Unit I Overview: Combining Different Models
Unit I Overview: Combining Different Models
Ensemble methods combine multiple base learners to produce a more powerful predictor. Condorcet's Jury Theorem (1785) shows that when independent voters are slightly better than chance, a majority vote approaches certainty. Machine learning exploits this through bagging, boosting, and stacking.
Why Ensembles Work
The expected test error decomposes as: Total Error = Bias^2 + Variance + Irreducible Noise
- Bagging trains models in parallel on bootstrap samples — reduces variance by averaging.
- Boosting trains models sequentially, correcting previous errors — reduces bias.
- Stacking learns a meta-model to optimally weight diverse base learners.
Unit I Roadmap
| Topic | Core Technique | Primary Benefit |
|---|---|---|
| Model Evaluation | Cross-validation, AUC | Reliable performance estimates |
| Statistical Learning Theory | PAC learning, VC dimension | Formal generalization bounds |
| Boosting | AdaBoost, GradientBoost | Bias reduction |
| Bagging | Bootstrap aggregating | Variance reduction |
| Random Forests | Random subspace + bagging | Bias + variance reduction |
No Free Lunch Theorem
No single algorithm outperforms all others on every problem distribution. This motivates ensembles: diverse models cover different hypothesis regions for broadly robust performance.
Requirements for Effective Ensembles
- Base learner accuracy must exceed 50% (better than random).
- Base learners must make diverse, uncorrelated errors — diversity is paramount.
- A combination rule (vote, average, or meta-learner) must aggregate predictions.
Exam-Ready Summary
- Ensemble error = f(individual errors, inter-model correlation)
- Lower inter-model correlation means greater variance reduction
- Bagging: parallel, best with unstable high-variance models (deep trees)
- Boosting: sequential, best with stable weak learners (decision stumps)
- Diversity is essential — identical models provide zero ensemble benefit
1. Evaluating ML Algorithms & Model Selection
1. Evaluating ML Algorithms and Model Selection
Model evaluation is the foundation of trustworthy ML. Without rigorous evaluation, we cannot distinguish genuine generalization from overfitting — a model that memorizes training data but fails in production.
Bias-Variance Decomposition
The generalization error decomposes as: Total Error = Bias^2 + Variance + Irreducible Noise
- Bias: Error from incorrect assumptions. High bias implies underfitting — the model is too simple.
- Variance: Error from sensitivity to training data. High variance implies overfitting — the model is too complex.
- Sweet spot: Complexity that minimizes bias^2 + variance simultaneously.
Cross-Validation Techniques
| Method | How It Works | When to Use |
|---|---|---|
| Holdout | Single 80/20 split | Large datasets, quick checks |
| k-Fold CV | k rotated validation splits | Standard practice |
| Stratified k-Fold | Preserves class proportions | Imbalanced classes |
| LOOCV | n-Fold, leave one out | Very small datasets |
| Time-Series CV | Forward walk-through splits | Sequential/temporal data |
Classification Metrics
| Metric | Formula | Prefer When |
|---|---|---|
| Accuracy | (TP+TN)/Total | Balanced classes |
| Precision | TP/(TP+FP) | False positives costly |
| Recall | TP/(TP+FN) | False negatives costly |
| F1-Score | 2PR/(P+R) | Imbalanced datasets |
| AUC-ROC | Area under ROC | Threshold-independent |
Model Selection Strategies
- Grid Search: Exhaustive sweep over hyperparameter grid.
- Random Search: Sample from distributions — often 3x more efficient for large grids.
- Bayesian Optimization: Build a surrogate model to focus on promising hyperparameter regions.
Common Pitfalls
- Data leakage: Preprocessing using future information contaminates results.
- Test set reuse: Multiple comparisons inflate apparent performance.
- Wrong metric: Accuracy is misleading on highly imbalanced datasets.
Exam-Ready Summary
- 5-fold or 10-fold CV is standard; LOOCV is unbiased but expensive
- Always stratify folds for classification tasks
- High bias: increase model complexity or add features
- High variance: add data, apply regularization, reduce model complexity
- Data leakage is the most dangerous source of misleading evaluation results
Frequently asked questions
Is the Machine Learning II course really free?
Yes. The entire Machine Learning II course on Siksha Sarovar is free to read with no account required. You can optionally sign in with Google to save your progress.
Do I get a certificate for Machine Learning II?
Yes — finish the lessons and pass the quiz to earn a free, verifiable certificate you can share on LinkedIn or with recruiters.
Can I run code while learning?
Yes. The built-in online compiler runs C, C++, Python, Java, PHP, JavaScript, C# and SQL directly in your browser — no installation needed.