The process of learning to maximize expected rewards by trial and error. Feedback is given in the form of rewards. Agent’s utility is defined by the reward function. All learning is based on observed samples of outcomes.
Builds on MDPs. Here, transition function and reward function are unknown. Can’t simulate the world as it is not a known model. Online learning.
Defined by:
- States
- Actions
- Policy
Model-Based vs. Model-Free
Section titled “Model-Based vs. Model-Free”Model-Based Learning
Section titled “Model-Based Learning”Learns from a model of the environment. Observations through multiple simulations, and functions are learnt. Used when the environment is known.
Model-Free Learning
Section titled “Model-Free Learning”Learns , directly from interactions with the environment. Used when the environment is too complex or cannot be modelled. Can use either sample-based policy evaluation or control methods.
Sample-Based Policy Evaluation
Section titled “Sample-Based Policy Evaluation”A method for estimating the value of a given policy by using averages obtained from samples. Used in some model-free learning algorithms.
Over multiple episodes, following , the following update rule is executed.
Here:
- be the discounted sum of rewards from a visit to
- be the number of samples for
Active vs. Passive
Section titled “Active vs. Passive”Active
Section titled “Active”Learns the best policy by interacting with the environment.
Exploration vs. Exploitation
Section titled “Exploration vs. Exploitation”A challenge for active RL. Agent mustb balance:
- Explore: try new actions to discover rewards.
- Exploit: use best-known action for high return.
Passive
Section titled “Passive”Follows the given policy . Learns the value of states using .
Algorithms
Section titled “Algorithms”All of the below algorithms use sample-based policy evaluation.
Direct Evaluation
Section titled “Direct Evaluation”Model-free. Passive. Learns the value of each state by averaging the future reward observed after under a fixed policy . Simple to implement.
Must wait until the episode ends to compute returns. Each state must be learnt separately. Slow.
import random
# Define a simple environmentstates = ['A', 'B', 'C', 'D', 'E']rewards = {'A': 2, 'B': 1, 'C': 5, 'D': 0, 'E': 10}transitions = { 'A': 'B', 'B': 'C', 'C': 'D', 'D': 'E', 'E': None # terminal}
# Parametersgamma = 1.0 # no discountnum_episodes = 1000
# Value table and returnsV = {s: 0.0 for s in states}returns = {s: [] for s in states}
for _ in range(num_episodes): # Start at random state state = random.choice(states) episode = []
# Generate an episode following fixed policy (always move to next) while state is not None: reward = rewards[state] episode.append((state, reward)) state = transitions[state]
# Calculate returns from each state in the episode G = 0 for t in reversed(range(len(episode))): s, r = episode[t] G = r + gamma * G returns[s].append(G)
# Compute average return for each statefor s in states: if returns[s]: V[s] = sum(returns[s]) / len(returns[s])
print("Estimated State Values (V):")for s, v in V.items(): print(f"{s}: {v:.2f}")Temporal Difference Learning
Section titled “Temporal Difference Learning”Aka. TD Learning. Model-free. Passive. Smarter version of Direct Evaluation. Updates value estimates after each step, not waiting for episode end. Combines ideas from Monte Carlo (learning from experience) and Dynamic Programming (bootstrapping using next state’s value). Incremental.
Faster than Direct Evaluation.
The sample value of using is calculated as:
The update rule is:
Or:
Here is the learning rate. Larger provides faster but noisier learning. Decreasing gives convergent average.
Q-Learning
Section titled “Q-Learning”Model-free. Active. Learns action values directly to find the optimal policy. Does sample-based, q-value iteration.
Requires enough exploration to be optimal. The learning rate must be reduced. And the reduction must be gradual, not too quick.
The sample q-value of at using is calculated as:
The update rule is:
Or:
The optimal policy can be extracted using .