Skip to content
Sahithyan's S3
1
Sahithyan's S3 — Artificial Intelligence

Reinforcement Learning

The process of learning to maximize expected rewards by trial and error. Feedback is given in the form of rewards. Agent’s utility is defined by the reward function. All learning is based on observed samples of outcomes.

Builds on MDPs. Here, transition function and reward function are unknown. Can’t simulate the world as it is not a known model. Online learning.

Defined by:

  • States SS
  • Actions A(s)A(s)
  • Policy π(s)\pi(s)

Learns from a model of the environment. Observations through multiple simulations, TT and RR functions are learnt. Used when the environment is known.

Learns TT, RR directly from interactions with the environment. Used when the environment is too complex or cannot be modelled. Can use either sample-based policy evaluation or control methods.

A method for estimating the value of a given policy π\pi by using averages obtained from samples. Used in some model-free learning algorithms.

Over multiple episodes, following π(s)\pi(s), the following update rule is executed.

Vπ(s)1N(s)i=1N(s)Gi(s)V_{\pi}(s) \approx \dfrac{1}{N(s)} \sum_{i=1}^{N(s)} G_i(s)

Here:

  • Gi(s)G_i(s) be the discounted sum of rewards from a visit to ss
  • N(s)N(s) be the number of samples for ss

Learns the best policy by interacting with the environment.

A challenge for active RL. Agent mustb balance:

  • Explore: try new actions to discover rewards.
  • Exploit: use best-known action for high return.

Follows the given policy π(s)\pi(s). Learns the value of states using π(s)\pi(s).

All of the below algorithms use sample-based policy evaluation.

Model-free. Passive. Learns the value of each state V(s)V(s) by averaging the future reward observed after ss under a fixed policy π(s)\pi(s). Simple to implement.

Must wait until the episode ends to compute returns. Each state must be learnt separately. Slow.

import random
# Define a simple environment
states = ['A', 'B', 'C', 'D', 'E']
rewards = {'A': 2, 'B': 1, 'C': 5, 'D': 0, 'E': 10}
transitions = {
'A': 'B',
'B': 'C',
'C': 'D',
'D': 'E',
'E': None # terminal
}
# Parameters
gamma = 1.0 # no discount
num_episodes = 1000
# Value table and returns
V = {s: 0.0 for s in states}
returns = {s: [] for s in states}
for _ in range(num_episodes):
# Start at random state
state = random.choice(states)
episode = []
# Generate an episode following fixed policy (always move to next)
while state is not None:
reward = rewards[state]
episode.append((state, reward))
state = transitions[state]
# Calculate returns from each state in the episode
G = 0
for t in reversed(range(len(episode))):
s, r = episode[t]
G = r + gamma * G
returns[s].append(G)
# Compute average return for each state
for s in states:
if returns[s]:
V[s] = sum(returns[s]) / len(returns[s])
print("Estimated State Values (V):")
for s, v in V.items():
print(f"{s}: {v:.2f}")

Aka. TD Learning. Model-free. Passive. Smarter version of Direct Evaluation. Updates value estimates V(s)V(s) after each step, not waiting for episode end. Combines ideas from Monte Carlo (learning from experience) and Dynamic Programming (bootstrapping using next state’s value). Incremental.

Faster than Direct Evaluation.

The sample value of ss using π\pi is calculated as:

Vπs(s)=R(s,π(s),s)+γVπ(s)V^s_\pi (s) = R(s,\pi(s),s') + \gamma V_\pi(s')

The update rule is:

Vπ(s)(1α)Vπ(s)+αVπs(s)V_\pi (s) \leftarrow (1-\alpha)V_\pi(s) + \alpha V^s_\pi(s)

Or:

Vπ(s)Vπ(s)+α[R(s,a,s)+γV(s)V(s)]V_\pi(s) \leftarrow V_\pi(s) + \alpha \Big[ R(s,a,s') + \gamma V(s') − V(s) \Big]

Here α\alpha is the learning rate. Larger α\alpha provides faster but noisier learning. Decreasing α\alpha gives convergent average.

Model-free. Active. Learns action values Q(s,a)Q(s, a) directly to find the optimal policy. Does sample-based, q-value iteration.

Requires enough exploration to be optimal. The learning rate must be reduced. And the reduction must be gradual, not too quick.

The sample q-value of ss at aa using π\pi is calculated as:

Qs(s,a)=R(s,a,s)+γmaxaQ(s,a)Q_s (s,a) = R(s,a,s') + \gamma \max_{a'} Q(s',a')

The update rule is:

Q(s,a)(1α)Q(s,a)+αQs(s)Q (s,a) \leftarrow (1-\alpha)Q(s,a) + \alpha Q_s(s)

Or:

Q(s,a)Q(s,a)+α[R(s,a,s)+γmaxaQ(s,a)Q(s,a)]Q(s, a) \leftarrow Q(s, a) + \alpha \Big[ R(s,a,s') + \gamma \max_a Q(s', a') − Q(s, a) \Big]

The optimal policy can be extracted using π(s)=argmaxaQ(s,a)\pi^*(s) = \arg\max_a Q(s,a).