Siksha Sarovar

Siksha Sarovar (sikshasarovar.com) is a free educational web application that helps students in India learn programming and prepare for academic and competitive exams. The platform offers structured coding courses (C, C++, Python, Java, HTML, CSS, PHP, Power BI, AI, Machine Learning, Data Science), complete university curriculum notes for BCA/MCA students with previous year question papers, Class 10 and Class 12 CBSE/HBSE school notes, and dedicated preparation material for SSC, UPSC, Banking, Railway and other government exams. Browsing the site is completely free and requires no account. Users may optionally sign in with Google solely to save their learning progress, quiz scores and personal preferences across devices.

Privacy Policy | Terms of Service | Contact Siksha Sarovar | About Siksha Sarovar

v4.0.9 · PWA
Siksha Sarovar logo
Siksha Sarovar
Your Learning Universe

Siksha Sarovar is a free e-learning platform for coding courses, BCA university notes and competitive exam preparation. Optional Google sign-in saves your learning progress across devices.

Initializing knowledge base…
Compiling modules 0%

5. Ensemble Methods: Random Forests

Lesson 6 of 22 in the free Machine Learning II notes on Siksha Sarovar, written by Rohit Jangra.

5. Ensemble Methods: Random Forests

Random Forests, introduced by Breiman (2001), extend bagging by adding random feature selection at each split, further decorrelating base trees for even greater variance reduction. They are among the most accurate and robust off-the-shelf classifiers available.

Algorithm

  1. For b = 1 to B:
  2. a. Draw bootstrap sample D_b from D b. Grow unpruned decision tree T_b, but at each node:

  • Select m features randomly from the full p features (typically m = sqrt(p) for classification)
  • Find the best split among the m selected features only
  1. Output ensemble: f(x) = (1/B) * sum(T_b(x)) or majority vote

Key Innovations Over Bagging

  • Bootstrap sampling: Reduces correlation by training on different data subsets
  • Random feature subspace: Reduces correlation further by limiting feature choices at each split
  • Combined effect: much lower inter-tree correlation than plain bagging

Feature Importance

Random Forests compute feature importance as the mean decrease in Gini impurity across all trees and splits: Importance(f) = sum over trees and nodes: delta_impurity * n_samples_node

Higher importance = feature contributes more to pure splits on average.

Out-of-Bag Score

Like bagging, the OOB score gives a free cross-validation estimate: OOB_score = (1/N) * sum(I[mode(T_b(x_i) for b where x_i not in D_b) == y_i])

Hyperparameters

HyperparameterTypical ValueEffect
n_estimators100-500More trees = lower variance, diminishing returns
max_featuressqrt(p) for clf, p/3 for regControls decorrelation
max_depthNone (full trees)Deeper = lower bias
min_samples_leaf1-5Controls overfitting

Common Pitfalls

  • Random Forests can overfit on very noisy data despite reputation for robustness
  • Feature importance can be biased toward high-cardinality features
  • Not the best for extrapolation (tree-based models cannot predict beyond training range)

Exam-Ready Summary

  • Random Forest = bagging + random feature subspace selection at each node split
  • m = sqrt(p) for classification, m = p/3 for regression (rule of thumb)
  • OOB error is a free, unbiased generalization estimate
  • Feature importance = mean decrease in Gini impurity, normalized across all trees
  • Increasing n_estimators never overfits, but has diminishing returns beyond ~200-500 trees