Siksha Sarovar

Siksha Sarovar (sikshasarovar.com) is a free educational web application that helps students in India learn programming and prepare for academic and competitive exams. The platform offers structured coding courses (C, C++, Python, Java, HTML, CSS, PHP, Power BI, AI, Machine Learning, Data Science), complete university curriculum notes for BCA/MCA students with previous year question papers, Class 10 and Class 12 CBSE/HBSE school notes, and dedicated preparation material for SSC, UPSC, Banking, Railway and other government exams. Browsing the site is completely free and requires no account. Users may optionally sign in with Google solely to save their learning progress, quiz scores and personal preferences across devices.

Privacy Policy | Terms of Service | Contact Siksha Sarovar | About Siksha Sarovar

v4.0.9 · PWA
Siksha Sarovar logo
Siksha Sarovar
Your Learning Universe

Siksha Sarovar is a free e-learning platform for coding courses, BCA university notes and competitive exam preparation. Optional Google sign-in saves your learning progress across devices.

Initializing knowledge base…
Compiling modules 0%

15. Elements of Reinforcement Learning

Lesson 19 of 22 in the free Machine Learning II notes on Siksha Sarovar, written by Rohit Jangra.

15. Elements of Reinforcement Learning

This lesson formalizes the key components of the RL framework: policies, value functions, and the role of the discount factor in shaping agent behavior.

Policies

A policy pi defines the agent's behavior:

  • Deterministic policy: a = pi(s) — maps state to action directly
  • Stochastic policy: a ~ pi(a|s) — probability distribution over actions
  • Optimal policy: pi* = argmax_pi V^pi(s) for all s

Value Functions

The state-value function under policy pi: V^pi(s) = E_pi[sum(gamma^t * r_t) | s_0 = s]

The action-value (Q) function under policy pi: Q^pi(s, a) = E_pi[sum(gamma^t * r_t) | s_0 = s, a_0 = a]

Relationship: V^pi(s) = sum_a pi(a|s) * Q^pi(s, a)

Bellman Equations

The Bellman expectation equations provide recursive characterizations: V^pi(s) = sum_a pi(a|s) sum_s' P(s'|s,a) [r(s,a,s') + gamma * V^pi(s')]

Q^pi(s,a) = sum_s' P(s'|s,a) [r(s,a,s') + gamma sum_a' pi(a'|s') * Q^pi(s',a')]

The Discount Factor Gamma

gamma controls how much the agent values future rewards:

Gamma ValueBehaviorWhen Useful
gamma = 0Fully myopic (only next reward)Short-horizon tasks
gamma = 0.9Balances near and far rewardsMost episodic tasks
gamma = 0.99Long-horizon planningComplex strategic tasks
gamma = 1.0Equal weight to all future rewardsFinite-horizon MDPs

Exploration vs Exploitation

  • Exploitation: Take the best known action (greedy)
  • Exploration: Try new actions to discover potentially better strategies

epsilon-greedy strategy: With probability epsilon, take a random action; otherwise greedy.

Common Pitfalls

  • Discount factor too small: agent ignores long-term consequences and behaves myopically
  • No exploration: agent gets stuck in a locally optimal policy (exploitation trap)
  • Wrong reward shaping can lead to unintended optimal behaviors (reward hacking)

Exam-Ready Summary

  • Policy pi(a|s): defines action selection; can be deterministic or stochastic
  • Value function V^pi(s): expected discounted return from state s under policy pi
  • Q-function Q^pi(s,a): expected return from (state, action) pair, then following pi
  • Bellman equation: recursive characterization linking value at s to values at successor states
  • Discount factor gamma near 1 encourages planning; near 0 encourages myopic behavior