Markov Decision Process (MDP)
Markov Decision Process (MDP)
Date: 06 Oct 2025
Introduction
A Markov Decision Process is a mathematical framework used for decision-making problems where outcomes are partially random and partially under the agent’s control.
Key Terms
1. State (S)
A state represents the current situation or condition of the agent.
2. Action (A)
Each state can have one or more possible actions that the agent can take.
3. Transition Function (T)
T(s, a) = s′
Product of two sets:
A function mapping:
$T: S \times A \rightarrow S$
It defines the probability of moving to the next state after taking an action.
4. Reward (R)
A numerical value received after taking an action.
Positive reward → good outcome
Negative reward → bad outcome
5. Policy
A policy defines the agent’s behaviour:
$\pi: S \rightarrow A$
= action chosen in state .
Return
$G_t = R_{t+1} + \gamma^1 R_{t+2} + \gamma^2 R_{t+3} + \dots$
If close to 0 → agent focuses on immediate rewards
If close to 1 → agent focuses on long-term rewards
Value Function
Long-term value of a state or action.
State Value (for MRP)
$V(s) = \mathbb{E}[G_t \mid S_t = s]$
Expected return starting from state s.
Bellman Expectation Equation (MRP)
$V(s) = \mathbb{E}[R_{t+1} + \gamma V(S_{t+1}) \mid S_t = s]$
It breaks value into:
Immediate reward
Discounted value of next state
State Value Function (MDP)
$V_\pi(s) = \mathbb{E}_\pi[G_t \mid S_t = s]$
Expected return under policy π.
Action Value Function (MDP)
$q_\pi(s, a) = \mathbb{E}_\pi[G_t \mid S_t = s, A_t = a]$
Optimal Value Functions
Used for finding the best (optimal) policy.
Optimal State Value Function
$V^*(s) = \max_{\pi} \; V_\pi(s)$
Optimal Action Value Function
$q^*(s, a) = \max_{\pi} \; q_\pi(s, a)$
Applications of MDP
Robotics: Robots use MDPs to move safely and efficiently.
Gameplay: Helps game characters choose the best strategy to win or complete tasks.
Healthcare: Doctors can plan the best treatment while considering uncertain outcomes.
Traffic/Navigation: Self-driving cars and delivery vehicles choose safe and efficient routes.
Inventory: Stores decide when to restock so they don’t run out or overstock, even when demand changes.
Markov Property (Memoryless Property)
MDP follows the Markov property, which states that the next state depends only on the current state and action, not on the past states.
Consider that the agent has visited states after taking actions , and has just taken action . The probability that the agent then arrives at state , given the history of previous states and actions, can be written as:
$P(S_{t+1} \mid S_0, S_1, \dots, S_t, A_0, A_1, \dots, A_t) = P(S_{t+1} \mid S_t, A_t)$
Policy Gradient Method
Policy gradient methods are used to find the optimal policy. This can be categorized into two parts:
1. Value-Based
Works by learning a value function, which can be used later to create a policy.
$V_\pi(s) = \mathbb{E}_\pi[G_t \mid S_t = s]$$\text{Where } G_t = \sum_{k=0}^{\infty} \gamma^k R_{t+k+1}$$
2. Policy-Based
Works by approximating the policy directly:
$a = \pi(s)$
Stochastically:
$\pi_\theta(a \mid s) = P[A \mid S, \theta]$
A policy is just a probabilistic distribution over possible actions given a state.
An objective function can represent the expected accumulated reward.
Policy gradient methods directly optimize the policy, unlike value-based methods that estimate state values.
Actor-Critic Algorithm
Actor-critic is a policy gradient method with two networks: actor and critic.
Actor: decides which action to take
Critic: informs the actor how good the action is and how to adjust it
The actor learns using the policy gradient approach, while the critic evaluates actions by computing the value function.
This way, the actor-critic algorithm is a mixture of value-based and policy gradient methods.
Steps:
Sample using policy πθ\pi_\thetaπθ from the actor network.
Evaluate the advantage function (can be called the TD error) .
$A_{\pi_\theta} = r$
Evaluate the gradient:
$\Delta J(\theta) \approx \sum_{t=0}^{T-1} \nabla_\theta \log \pi_\theta(S_t, A_t) \cdot A_{\pi_\theta}(S_t, A_t)$
Update the policy parameter θ\thetaθ:
$\theta = \theta + \alpha \Delta J(\theta)$
Update the weights of the critic based on value-based RL (using the advantage function).
Repeat steps 1–5 until the optimal policy is found.
Applications of Policy Gradient Methods
Robotics: Learn tasks like picking objects, walking, and avoiding obstacles.
Autonomous Vehicles: Optimize driving efficiency.
Games: AI learns strategies in video games like chess through trial and error.
ML Tasks: Optimize policies for tasks like machine translation or dialog generation.
Time Series Analysis (Brief Overview)
A time series is a sequence of data points collected at successive time intervals.
Examples: stock prices, temperature readings, sales figures.
Components:
Trend: Long-term movement of data (increasing, decreasing, or stable).
Seasonality: Patterns occurring at regular intervals (e.g., daily, monthly, yearly).
Cyclicality: Long-term patterns without fixed periodicity (e.g., business cycles).
Irregularity/Noise: Random, unpredictable fluctuations.
Time Series Forecasting
Statistical techniques are used to predict future values based on past observations.
ARIMA (AutoRegressive Integrated Moving Average) is a common method.
$X_t = [\phi_1 x_{t-1} + \phi_2 x_{t-2} + \dots + \phi_p x_{t-p}] + [\Delta^d x_t] + [\phi_1 E_{t-1} + \dots + \phi_q E_{t-q}]$
Components
AR (AutoRegressive):
$x_t = \phi_1 x_{t-1} + \dots + \phi_p x_{t-p} + E_t$
→ AR coefficients, → order, → white noise
I (Integrated):
$\Delta^d x_t = x_t - x_{t-1}$
MA (Moving Average):
$MA = \phi_1 E_{t-1} + \dots + \phi_q E_{t-q} + E_t$
Uses past errors to forecast the series.
Model Parameters
p: number of lag observations (AR order)
d: number of times raw observations are differenced (I)
q: order of moving average (MA)
Values of p, d, q are hyperparameters, tuned via trial and error for best forecasting accuracy.
