Skip to content
Sahithyan's S3
1
Sahithyan's S3 — Artificial Intelligence

Markov Decision Process

Aka. MDP. Defines a stochastic control problem. Denoted by M=(S,A,T,R)M=(S,A,T,R).

Here:

  • SS is the set of states
  • AA is the set of actions
  • T:S×A×SR0T: S \times A \times S \rightarrow \mathbb{R}_{\ge 0} is the transition function
  • R:S×AR0R: S \times A \rightarrow \mathbb{R}_{\ge 0} is the reward function

The objective is to maximize the sum of future rewards. A policy will be calculated to determine how to act.

Markov means the future states depends only on the current state and action; not on the history of states.

An environment in which outcomes are governed by probabilistic rules rather than being fully deterministic.

The result of an agent’s action is not always predictable; instead, each action can lead to multiple possible outcomes, each with a certain probability. This uncertainty requires agents to reason about the likelihood of different future states and to make decisions that account for risk and variability.

Stochastic domains are commonly modeled using frameworks like MDPs.

Describes the probability of moving from one state to another given an action. Denoted using T(s,a,s)T(s,a,s'). The probability is denoted as P(ss,a)P(s'|s,a). Here:

  • ss: current state
  • aa: action taken
  • ss': next state

Assigns a numerical value to each state-action pair, indicating the immediate reward received by the agent when transitioning from a state ss to a state ss' by taking action aa. Denoted using R(s,a,s)R(s,a,s'). Here:

  • ss: current state
  • aa: action taken
  • ss': next state

A function that maps states to actions. Denoted by π(s):SA\pi(s): S \to A. Determines the best action to be taken in a given state.

A policy that maximizes the expected sum of future rewards. Denoted using π(s)\pi^*(s).

A policy that exhaustively defines the action for all the states. Results in a reflex agent.

The process of computing the state-value function VπV_\pi for a given policy π\pi.

Vπ(s)=sT(s,π(s),s)[R(s,π(s),s)+γVπ(s)]V_\pi (s) = \sum_{s'} T\left(s,\pi(s),s'\right) \Big[ R\left(s,\pi(s),s'\right) + \gamma V_\pi(s') \Big]

It’s reasonable to prefer rewards sooner. A solution for incorporating discounting rewards is to use a discount factor γ(0,1]\gamma \in (0,1] that reduces the importance of future rewards. The discounted reward is denoted as:

Gt=k=0γkRt+k+1G_t = \sum_{k=0}^{\infty} \gamma^k R_{t+k+1}

Here:

  • tt is the current time step
  • Rt+k+1R_{t+k+1} is the reward received at time step t+k+1t+k+1

States that the optimal value of a state equals the maximum expected return obtainable by choosing the best action now and then behaving optimally thereafter. Recursively defined.

Q(s,a)=sT(s,a,s)[R(s,a,s)+γV(s)]V(s)=maxaQ(s,a)\begin{equation} \nonumber \begin{split} Q^*(s,a) & = \sum_{s'} T(s, a, s') \Big[ R(s,a,s') + \gamma V^*(s') \Big] \\ V^*(s) \quad & = \max_a Q^*(s,a) \end{split} \end{equation}

Here:

  • V(s)V^*(s) is the optimal value of ss.
  • Q(s,a)Q^*(s,a) is the optimal action-value of aa at ss.
    Represents the maximum reward for taking an action aa in state ss.

A decision-making algorithm that extends minimax to environments with randomness.

Represents the decision point where the agent chooses the action that maximizes expected utility. Agent assumes that the opponent will act optimally and that chance outcomes will occur according to their probabilities.

Represents the decision point where the opponent chooses an action that minimizes the agent’s expected utility. The algorithm assumes rational adversarial play.

Represents uncertainty such as dice rolls or card draws. The value of a chance node is the expected value over all possible outcomes:

V=iP(i)×ViV = \sum_i P(i) \times V_i

Here:

  • P(i)P(i) - probability of outcome ii
  • ViV_i - value resulting from that outcome
  • Start from the terminal nodes (end states) and assign them utility values.
  • Move upward through the tree:
    • At max nodes, take the maximum of child values.
    • At min nodes, take the minimum.
    • At chance nodes, take the weighted average (expected value).
  • The root’s value represents the expected outcome of optimal play with chance considered.

A fixed point method. Computes the optimal state-value function VV* by iteratively applying the Bellman Optimality Equation until convergence.

Initial states:

V0(s)=0,sS,Q0(s,a)=0,sS,  aA.\begin{aligned} V_0(s) &= 0, &\forall s \in S,\\ Q_0(s,a) &= 0, &\forall s \in S,\; a \in A. \end{aligned}

For all sSs \in S, the below updates are performed, until maximum change across states is below a small threshold θ\theta:

Qk+1(s,a)=sT(s,a,s)[R(s,a,s)+γVk(s)]Vk+1(s)=maxaQk+1(s)\begin{equation} \nonumber \begin{split} Q_{k+1}(s,a) & = \sum_{s'} T(s, a, s') \Big[ R(s,a,s') + \gamma V_k(s') \Big] \\ V_{k+1}(s) \quad & = \max_a Q_{k+1}(s) \end{split} \end{equation}

Here:

Once the threshold is reached, the algorithm terminates. π(s)\pi^*(s) can be found by the latest iteration’s results.

π(s)=argmaxaQ(s,a)\pi^*(s) = \arg\max_a Q^*(s,a)

Requirements for convergence:

  • No zero cost loops
  • Goal state must be reachable from any state with probability 1.

Analysis:

  • Time complexity: O(S2A)O(S^2A) per iteration.
  • Space complexity: O(S)O(S).
from typing import Dict, List, Tuple, Any
def value_iteration(states: List[Any],
actions: Dict[Any, List[Any]],
P: Dict[Any, Dict[Any, List[Tuple[float, Any, float]]]],
gamma: float = 0.99,
theta: float = 1e-6):
"""
Perform value iteration.
Args:
states: list of states
actions: dict mapping state -> list of possible actions in that state
P: transition model:
P[s][a] = list of (prob, next_state, reward) tuples
gamma: discount factor
theta: threshold for convergence
Returns:
V: dict mapping state -> value
policy: dict mapping state -> best action (one of actions[state])
"""
V = {s: 0.0 for s in states}
while True:
delta = 0.0
for s in states:
v_old = V[s]
best_value = float('-inf')
# If no actions available, value remains 0 (or could be terminal)
if not actions.get(s):
best_value = 0.0
else:
for a in actions[s]:
q = 0.0
for prob, s_next, r in P[s][a]:
q += prob * (r + gamma * V[s_next])
if q > best_value:
best_value = q
V[s] = best_value
delta = max(delta, abs(v_old - V[s]))
if delta < theta:
break
# Extract greedy policy
policy = {}
for s in states:
if not actions.get(s):
policy[s] = None
continue
best_a = None
best_q = float('-inf')
for a in actions[s]:
q = 0.0
for prob, s_next, r in P[s][a]:
q += prob * (r + gamma * V[s_next])
if q > best_q:
best_q = q
best_a = a
policy[s] = best_a
return V, policy
states = ["s0", "s1"]
actions = {
"s0": ["stay", "go"],
"s1": ["stay"]
}
# P[state][action] = list of (probability, next_state, reward)
P = {
"s0": {
"stay": [(1.0, "s0", 0.0)],
"go": [(1.0, "s1", 1.0)]
},
"s1": {
"stay": [(1.0, "s1", 0.0)]
}
}
V, policy = value_iteration(states, actions, P, gamma=0.9, theta=1e-6)
print("Value function:", V)
print("Policy:", policy)

Similar to value iteration. Calculates q-values instead of state-values. Approximately equivalent in computation with value iteration.

Initially Q0(s,a)=0,sS,  aAQ_0(s,a) = 0, \forall s \in S,\; a \in A.

Qk+1(s,a)=sT(s,a,s)[R(s,a,s)+γmaxaQk(s,a)]Q_{k+1}(s,a) = \sum_{s'} T(s,a,s') \Big[ R(s,a,s') + \gamma \max_{a'} Q_k(s',a') \Big]

Here:

  • Qk(s,a)Q_k(s,a): estimate at iteration kk,
  • T(s,a,s)T(s,a,s'): transition probability,
  • R(s,a,s)R(s,a,s'): immediate reward,
  • γ\gamma: discount factor.

Alternates between two steps until convergence:

  • Policy evaluation: given a (fixed) policy π, compute its state-value function V^π (exactly or approximately).
  • Policy improvement: produce a new greedy policy π’ by acting greedily with respect to V^π.

For a deterministic policy π\pi, the Bellman equation for VπV^\pi is:

Vπ(s)=sum_sT(s,π(s),s)[R(s,π(s),s)+γVπ(s)]V^\pi(s) = sum\_{s'} T(s,\pi(s),s*) [ R(s,\pi(s),s') + \gamma V^\pi(s') ]

Policy iteration typically converges in a finite number of policy improvements for finite MDPs to an optimal policy π*.

from typing import Dict, List, Tuple, Any, Optional
import copy
def policy_evaluation(states: List[Any],
actions: Dict[Any, List[Any]],
P: Dict[Any, Dict[Any, List[Tuple[float, Any, float]]]],
policy: Dict[Any, Optional[Any]],
gamma: float = 0.99,
theta: float
= 1e-6,
max_iterations: int = 1000) -> Dict[Any, float]:
"""
Evaluate a given policy by iterative policy evaluation (synchronous updates).
Args:
states: list of states
actions: dict mapping state -> list of possible actions
P: transition model P[s][a] = list of (prob, next_state, reward)
policy: dict mapping state -> action (or None for terminal / no-op)
gamma: discount factor
theta: threshold for stopping
max_iterations: max number of sweeps
Returns:
V: dict mapping state -> value under the policy
"""
V = {s: 0.0 for s in states}
for i in range(max_iterations):
delta = 0.0
V_new = V.copy()
for s in states:
a = policy.get(s)
# If policy prescribes no action (terminal), value is 0 (or unchanged)
if a is None or not actions.get(s):
v = 0.0
else:
v = 0.0
for prob, s_next, r in P[s][a]:
v += prob * (r + gamma * V[s_next])
V_new[s] = v
delta = max(delta, abs(V[s] - V_new[s]))
V = V_new
if delta < theta:
break
return V
def policy_improvement(states: List[Any],
actions: Dict[Any, List[Any]],
P: Dict[Any, Dict[Any, List[Tuple[float, Any, float]]]],
V: Dict[Any, float],
gamma: float = 0.99) -> Dict[Any, Optional[Any]]:
"""
Given a value function V, return a greedy policy.
Args:
states, actions, P: MDP specification (same as above)
V: dict mapping state -> value estimate
gamma: discount factor
Returns:
policy: dict mapping state -> best action (or None)
"""
policy = {}
for s in states:
if not actions.get(s):
policy[s] = None
continue
best_a = None
best_q = float('-inf')
for a in actions[s]:
q = 0.0
for prob, s_next, r in P[s][a]:
q += prob * (r + gamma * V[s_next])
if q > best_q:
best_q = q
best_a = a
policy[s] = best_a
return policy
def policy_iteration(states: List[Any],
actions: Dict[Any, List[Any]],
P: Dict[Any, Dict[Any, List[Tuple[float, Any, float]]]],
gamma: float = 0.99,
theta: float = 1e-6,
max_eval_iterations: int = 1000,
max_policy_iters: int = 1000) -> Tuple[Dict[Any, float], Dict[Any, Optional[Any]]]:
"""
Perform policy iteration.
Args:
states: list of states
actions: dict mapping state -> list of possible actions
P: transition model
gamma: discount factor
theta: threshold for policy evaluation stopping
max_eval_iterations: max sweeps for policy evaluation
max_policy_iters: max number of policy improvement iterations
Returns:
V: dict mapping state -> value (for final policy)
policy: dict mapping state -> action
"""
# Initialize a random (or arbitrary) policy: pick first available action
policy = {}
for s in states:
if not actions.get(s):
policy[s] = None
else:
policy[s] = actions[s][0]
for i in range(max_policy_iters):
# Policy evaluation (compute V^pi)
V = policy_evaluation(states, actions, P, policy, gamma, theta, max_eval_iterations)
# Policy improvement (greedy w.r.t. V)
new_policy = policy_improvement(states, actions, P, V, gamma)
# Check for convergence (policy stable)
if new_policy == policy:
break
policy = new_policy
# Final evaluation to return V consistent with final policy
V = policy_evaluation(states, actions, P, policy, gamma, theta, max_eval_iterations)
return V, policy
# Example usage: a tiny MDP (same setup as in value iteration section)
states = ["s0", "s1"]
actions = {
"s0": ["stay", "go"],
"s1": ["stay"]
}
P = {
"s0": {
"stay": [(1.0, "s0", 0.0)],
"go": [(1.0, "s1", 1.0)]
},
"s1": {
"stay": [(1.0, "s1", 0.0)]
}
}
V, policy = policy_iteration(states, actions, P, gamma=0.9, theta=1e-6)
print("Value function:", V)
print("Policy:", policy)