Siksha Sarovar

Siksha Sarovar (sikshasarovar.com) is a free educational web application that helps students in India learn programming and prepare for academic and competitive exams. The platform offers structured coding courses (C, C++, Python, Java, HTML, CSS, PHP, Power BI, AI, Machine Learning, Data Science), complete university curriculum notes for BCA/MCA students with previous year question papers, Class 10 and Class 12 CBSE/HBSE school notes, and dedicated preparation material for SSC, UPSC, Banking, Railway and other government exams. Browsing the site is completely free and requires no account. Users may optionally sign in with Google solely to save their learning progress, quiz scores and personal preferences across devices.

Privacy Policy | Terms of Service | Contact Siksha Sarovar | About Siksha Sarovar

v4.0.9 · PWA
Siksha Sarovar logo
Siksha Sarovar
Your Learning Universe

Siksha Sarovar is a free e-learning platform for coding courses, BCA university notes and competitive exam preparation. Optional Google sign-in saves your learning progress across devices.

Initializing knowledge base…
Compiling modules 0%

17. Policy Search

Lesson 21 of 22 in the free Machine Learning II notes on Siksha Sarovar, written by Rohit Jangra.

17. Policy Search

Policy search methods directly optimize the policy parameters without necessarily learning a value function. They are particularly powerful for continuous action spaces, stochastic environments, and when the policy structure itself can encode domain knowledge.

Policy Gradient Theorem

For a parameterized policy pi_theta(a|s), the gradient of the expected return is: nabla_theta J(theta) = E_pi[nabla_theta log pi_theta(a|s) * Q^pi(s,a)]

This allows gradient ascent on J even without knowing the environment model.

REINFORCE (Monte Carlo Policy Gradient)

  1. Sample a complete trajectory under current policy: tau = (s_0, a_0, r_0, ..., s_T)
  2. Compute returns: G_t = sum_{k=t}^{T} gamma^{k-t} * r_k
  3. Update: theta = theta + alpha sum_t nabla_theta log pi_theta(a_t|s_t) G_t

Problem: High variance due to Monte Carlo returns. Adding a baseline b(s) reduces variance without introducing bias: nabla J = E[nabla log pi * (G_t - b(s_t))]

Actor-Critic Methods

Use a critic (value function) as baseline to reduce variance while maintaining low bias:

  • Actor: Policy pi_theta(a|s) — updated using policy gradient
  • Critic: Value function V_w(s) — updated using TD error as baseline

TD error (advantage estimate): delta_t = r_t + gamma * V_w(s_{t+1}) - V_w(s_t)

Proximal Policy Optimization (PPO)

PPO (Schulman et al., 2017) clips the policy update ratio to prevent destructively large updates: L_CLIP(theta) = E[min(r_t(theta) A_t, clip(r_t, 1-eps, 1+eps) A_t)]

Where r_t = pi_theta / pi_theta_old and A_t is the advantage estimate.

Method Comparison

MethodVarianceBiasSample EfficiencyStability
REINFORCEVery highLowLowPoor
Actor-CriticMediumLowMediumMedium
PPOLowLowHighHigh

Exam-Ready Summary

  • Policy gradient theorem: gradient = E[log pi * Q] — no model needed
  • REINFORCE: Monte Carlo; high variance due to full trajectory returns
  • Baseline subtraction: reduces variance without bias (V(s) is common baseline)
  • Actor-Critic: critic estimates value function as variance-reducing baseline
  • PPO: clips policy update to trust region — state-of-the-art stable policy gradient