Siksha Sarovar

Siksha Sarovar (sikshasarovar.com) is a free educational web application that helps students in India learn programming and prepare for academic and competitive exams. The platform offers structured coding courses (C, C++, Python, Java, HTML, CSS, PHP, Power BI, AI, Machine Learning, Data Science), complete university curriculum notes for BCA/MCA students with previous year question papers, Class 10 and Class 12 CBSE/HBSE school notes, and dedicated preparation material for SSC, UPSC, Banking, Railway and other government exams. Browsing the site is completely free and requires no account. Users may optionally sign in with Google solely to save their learning progress, quiz scores and personal preferences across devices.

Privacy Policy | Terms of Service | Contact Siksha Sarovar | About Siksha Sarovar

v4.0.9 · PWA
Siksha Sarovar logo
Siksha Sarovar
Your Learning Universe

Siksha Sarovar is a free e-learning platform for coding courses, BCA university notes and competitive exam preparation. Optional Google sign-in saves your learning progress across devices.

Initializing knowledge base…
Compiling modules 0%

16. Generalization in Reinforcement Learning

Lesson 20 of 22 in the free Machine Learning II notes on Siksha Sarovar, written by Rohit Jangra.

16. Generalization in Reinforcement Learning

Generalization in RL is the ability to perform well on states not seen during training. Deep RL extends the Q-function and policy to neural networks, enabling agents to operate in large, continuous state spaces.

Deep Q-Networks (DQN)

DeepMind (Mnih et al., 2015) achieved human-level performance on 49 Atari games using: Q(s, a; theta) ≈ Q*(s, a) (parameterized by neural network)

Two key innovations:

  1. Experience Replay: Store transitions (s, a, r, s') in a replay buffer. Sample random mini-batches to break temporal correlations between consecutive samples.
  1. Target Network: Use a separate, periodically updated target network to compute stable TD targets:
  2. L(theta) = E[(r + gamma * max_a' Q(s', a'; theta^-) - Q(s, a; theta))^2]

Deadly Triad in Function Approximation

Three ingredients that together can cause divergence:

  1. Off-policy learning: Learning from data generated by a different policy
  2. Function approximation: Approximating Q with a non-linear function (e.g., neural network)
  3. Bootstrapping: Using current estimates to update current estimates (TD methods)

Experience replay and target networks stabilize the otherwise divergent deadly triad.

Generalization Challenges

ChallengeDescriptionSolution
Catastrophic forgettingNew experiences override old onesExperience replay
Unstable targetsTarget keeps changingTarget network
Overestimation biasMax operator overestimates QDouble DQN
Sample inefficiencyNeeds many samplesModel-based RL, prioritized replay

Policy Gradient Methods

Instead of learning Q, directly optimize the policy: J(theta) = E_pi[G_t], gradient: nabla J = E[nabla log pi(a|s) * G_t]

This REINFORCE estimator is high variance — Actor-Critic methods reduce variance using a baseline.

Exam-Ready Summary

  • DQN: Q-learning with neural network function approximator + experience replay + target network
  • Experience replay: breaks temporal correlations, improves sample efficiency
  • Target network: prevents oscillations by fixing Q-learning targets for several steps
  • Deadly triad: off-policy + function approximation + bootstrapping can diverge
  • Actor-Critic: policy gradient with value function baseline to reduce variance