Siksha Sarovar

Siksha Sarovar (sikshasarovar.com) is a free educational web application that helps students in India learn programming and prepare for academic and competitive exams. The platform offers structured coding courses (C, C++, Python, Java, HTML, CSS, PHP, Power BI, AI, Machine Learning, Data Science), complete university curriculum notes for BCA/MCA students with previous year question papers, Class 10 and Class 12 CBSE/HBSE school notes, and dedicated preparation material for SSC, UPSC, Banking, Railway and other government exams. Browsing the site is completely free and requires no account. Users may optionally sign in with Google solely to save their learning progress, quiz scores and personal preferences across devices.

Privacy Policy | Terms of Service | Contact Siksha Sarovar | About Siksha Sarovar

v4.0.9 · PWA
Siksha Sarovar logo
Siksha Sarovar
Your Learning Universe

Siksha Sarovar is a free e-learning platform for coding courses, BCA university notes and competitive exam preparation. Optional Google sign-in saves your learning progress across devices.

Initializing knowledge base…
Compiling modules 0%

18. Adaptive Dynamic Programming

Lesson 22 of 22 in the free Machine Learning II notes on Siksha Sarovar, written by Rohit Jangra.

18. Adaptive Dynamic Programming

Adaptive Dynamic Programming (ADP) bridges dynamic programming (which requires a model) and model-free RL (which learns from experience). It encompasses classical techniques like value iteration and policy iteration as well as temporal-difference methods.

Value Iteration

Given the MDP model (P, R, gamma), value iteration computes the optimal value function: V_{k+1}(s) = max_a [R(s,a) + gamma sum_s' P(s'|s,a) V_k(s')]

Convergence guarantee: V_k converges to V as k -> infinity (contraction mapping theorem): ||V_{k+1} - V|| <= gamma ||V_k - V||

Policy Iteration

Alternates between policy evaluation and policy improvement:

  1. Policy evaluation: Compute V^pi by solving: V^pi = R^pi + gamma P^pi V^pi
  2. Policy improvement: pi' = argmax_a [R(s,a) + gamma sum_s' P(s'|s,a) V^pi(s')]
  3. Repeat until policy converges (finite steps guaranteed)

Temporal Difference (TD) Learning

TD methods learn from experience without a model: V(s_t) = V(s_t) + alpha [r_t + gamma V(s_{t+1}) - V(s_t)]

The TD error delta_t = r_t + gamma * V(s_{t+1}) - V(s_t) is the key learning signal.

Q-Learning (Off-Policy TD)

Q(s_t, a_t) = Q(s_t, a_t) + alpha [r_t + gamma max_a' Q(s_{t+1}, a') - Q(s_t, a_t)]

Converges to optimal Q* under conditions: all (s, a) pairs visited infinitely often, learning rate satisfies Robbins-Monro conditions.

Comparison

MethodRequires ModelConvergenceSample Efficiency
Value IterationYesExact (finite steps)High (model-based)
Policy IterationYesExact (finite steps)High (model-based)
TD(0)NoApprox (to V^pi)Medium
Q-LearningNoApprox (to V*)Lower (off-policy)

Common Pitfalls

  • Value iteration: requires complete model P — often unavailable in practice
  • TD methods: convergence requires careful learning rate scheduling
  • Q-learning with function approximation: may diverge (deadly triad)

Exam-Ready Summary

  • Value iteration: model-based, computes V* using Bellman optimality equation iteratively
  • Policy iteration: alternates policy evaluation (solve system) and policy improvement (greedy)
  • TD(0): learns from single-step bootstrapped returns — online, incremental
  • Q-learning: off-policy TD, learns Q* regardless of behavior policy
  • ADP connects dynamic programming (model-based) with TD methods (model-free) via function approximation