Siksha Sarovar

Siksha Sarovar (sikshasarovar.com) is a free educational web application that helps students in India learn programming and prepare for academic and competitive exams. The platform offers structured coding courses (C, C++, Python, Java, HTML, CSS, PHP, Power BI, AI, Machine Learning, Data Science), complete university curriculum notes for BCA/MCA students with previous year question papers, Class 10 and Class 12 CBSE/HBSE school notes, and dedicated preparation material for SSC, UPSC, Banking, Railway and other government exams. Browsing the site is completely free and requires no account. Users may optionally sign in with Google solely to save their learning progress, quiz scores and personal preferences across devices.

Privacy Policy | Terms of Service | Contact Siksha Sarovar | About Siksha Sarovar

v4.0.9 · PWA
Siksha Sarovar logo
Siksha Sarovar
Your Learning Universe

Siksha Sarovar is a free e-learning platform for coding courses, BCA university notes and competitive exam preparation. Optional Google sign-in saves your learning progress across devices.

Initializing knowledge base…
Compiling modules 0%

Unit IV Overview: Reinforcement Learning

Lesson 18 of 22 in the free Machine Learning II notes on Siksha Sarovar, written by Rohit Jangra.

Unit IV Overview: Reinforcement Learning

Reinforcement Learning (RL) addresses the problem of learning to act in an environment to maximize cumulative reward. Unlike supervised learning, there are no labeled examples — the agent must discover good actions through trial and error. RL has achieved superhuman performance in games (AlphaGo, OpenAI Five) and robotics.

The RL Framework

An RL agent interacts with an environment in a loop:

  1. Agent observes current state s_t
  2. Agent selects action a_t based on policy pi(a | s)
  3. Environment transitions to new state s_{t+1}
  4. Agent receives reward r_t
  5. Goal: maximize cumulative discounted reward: G_t = sum(gamma^k * r_{t+k})

Markov Decision Process (MDP)

The RL problem is formalized as an MDP: (S, A, P, R, gamma) where:

  • S: state space
  • A: action space
  • P(s' | s, a): transition probabilities
  • R(s, a, s'): reward function
  • gamma in (0, 1]: discount factor

Unit IV Roadmap

TopicMethodKey Idea
Elements of RLMDP, policy, value functionsProblem formulation
GeneralizationDQN, function approximationScale to large state spaces
Policy SearchREINFORCE, Actor-Critic, PPODirect policy optimization
ADPValue iteration, TD learningModel-based methods

Key Concepts

  • Policy pi(a|s): Probability of taking action a in state s
  • Value function V^pi(s): Expected return from state s under policy pi
  • Q-function Q^pi(s,a): Expected return from state s, taking action a, then following pi
  • Bellman equation: V^pi(s) = sum_a pi(a|s) sum_s' P(s'|s,a) [R(s,a,s') + gamma * V^pi(s')]

Exam-Ready Summary

  • RL: agent learns policy by maximizing cumulative reward through environment interaction
  • MDP: mathematical framework for sequential decision making (state, action, transition, reward, discount)
  • Key distinction: model-based RL (knows P) vs model-free RL (learns from experience)
  • Discount factor gamma: near 0 = myopic agent; near 1 = far-sighted agent
  • Exploration vs exploitation is the fundamental trade-off in all RL algorithms