Markov Decision Process (MDP)

Nov 19, 2025

Updated 1 month ago

3 min read

Abhijeet Singh Rajput

@abhijeetsingh

Markov Decision Process (MDP)

Date: 06 Oct 2025

Introduction

A Markov Decision Process (MDP) is a mathematical framework used in reinforcement learning, machine learning, and artificial intelligence for solving decision-making problems where outcomes are partially random and partially controlled by an agent.

MDPs are widely used in robotics, autonomous systems, gameplay AI, healthcare planning, recommendation systems, and self-driving vehicles.

Key Terms in MDP

1. State (S)

A state represents the current situation or environment of the agent.

Example:

In a chess game, the arrangement of pieces on the board is a state.
In robotics, the robot’s current position is a state.

2. Action (A)

An action is a possible move the agent can take.

A = {a_{1}, a_{2}, \dots, a_{n}}

Each state can have one or more actions available.

Example:

Move left
Move right
Pick object
Stop

3. Transition Function (T)

The transition function defines how the environment changes after an action.

T (s, a) = s^{'}

T : S \times A \to S

It represents the probability of moving from one state to another after taking an action.

Cartesian Product

A \times B = {(a, b)}

where:

$a \in A$
$b \in B$

4. Reward (R)

A reward is a numerical feedback signal received after taking an action.

Positive reward → desirable outcome
Negative reward → undesirable outcome

Example:

Winning a game → +100 reward
Crashing a robot → −50 reward

5. Policy (π)

A policy defines the behavior of the agent.

$π : S \to A$

$π (s) = action chosen in state$

The policy tells the agent what action to take in each state.

Return $(Gt)$

The return is the total future reward accumulated by the agent.

G_{t} = R_{t + 1} + γ^{1} R_{t + 2} + γ^{2} R_{t + 3} + \dots

$G_{t} = R_{t + 1} + γ R_{t + 2} + γ^{2} R_{t + 3} + \dots$

Where:

$γ$ = discount factor
$0 \leq γ \leq 1$

Interpretation:

If $γ$ is close to 0 → the agent focuses on immediate rewards.
If $γ$ is close to 1 → the agent focuses on long-term rewards.

Value Function

A value function estimates how beneficial a state or action is in the long run.

State Value Function (MRP)

V (s) = E [G_{t} ∣ S_{t} = s]

$V (s) = E [G_{t} ∣ S_{t} = s]$

It represents the expected return starting from state $s$ .

Bellman Expectation Equation

The Bellman equation breaks the value into:

Immediate reward
Discounted future value

V (s) = E [R_{t + 1} + γ V (S_{t + 1}) ∣ S_{t} = s]

$V (s) = E [R_{t + 1} + γ V (S_{t + 1}) ∣ S_{t} = s]$

This recursive structure is the foundation of reinforcement learning algorithms.

State Value Function in MDP

V_{π} (s) = E_{π} [G_{t} ∣ S_{t} = s]

This calculates the expected return while following policy π\piπ.

Action Value Function

q_{π} (s, a) = E_{π} [G_{t} ∣ S_{t} = s, A_{t} = a]

The action value function estimates the quality of taking action $a$ in state $s$ .

Optimal Value Functions

Optimal value functions help determine the best possible policy.

Optimal State Value Function

V^{*} (s) = π max V_{π} (s)

Optimal Action Value Function

q^{*} (s, a) = π max q_{π} (s, a)

Markov Property (Memoryless Property)

MDP follows the Markov Property, meaning the future depends only on the current state and action, not on past history.

P (S_{t + 1} ∣ S_{0}, S_{1}, \dots, S_{t}, A_{0}, A_{1}, \dots, A_{t}) = P (S_{t + 1} ∣ S_{t}, A_{t})

This property simplifies complex decision-making systems.

Policy Gradient Method

Policy gradient methods are reinforcement learning techniques used to directly optimize policies.

They are mainly divided into two categories:

1. Value-Based Methods

Value-based methods first learn the value function and then derive a policy from it.

V_{π} (s) = E_{π} [G_{t} ∣ S_{t} = s]

G_{t} = k = 0 \sum \infty γ^{k} R_{t + k + 1}

Examples:

Q-Learning
Deep Q-Networks (DQN)

2. Policy-Based Methods

Policy-based methods directly learn the policy.

a = π (s)

Stochastic policy:

π_{θ} (a ∣ s) = P [A ∣ S, θ]

These methods optimize the expected reward objective function $J (θ)$ .

Advantages:

Better for continuous action spaces
More stable in some environments

Actor-Critic Algorithm

The Actor-Critic algorithm combines value-based and policy-based reinforcement learning approaches.

Components

Actor

Chooses actions
Learns the policy

Critic

Evaluates actions
Computes value functions and TD error

Steps of Actor-Critic Algorithm

Sample $(S_{t}, A_{t})$ using policy $π_{θ}$
Compute advantage function

A_{π_{θ}} = r

Compute policy gradient

Δ J (θ) \approx t = 0 \sum T - 1 \nabla_{θ} lo g π_{θ} (S_{t}, A_{t}) \cdot A_{π_{θ}} (S_{t}, A_{t})

Update policy parameters

$θ = θ + α Δ J (θ)$

Update critic weights

Repeat until the optimal policy is achieved.

Applications of MDP

Robotics

Robots use MDPs for:

Navigation
Path planning
Obstacle avoidance

Gameplay AI

Used in:

Chess engines
Video game agents
Strategy optimization

Healthcare

Doctors and AI systems use MDPs for treatment planning under uncertain outcomes.

Traffic and Navigation

Applications include:

Self-driving cars
GPS route optimization
Delivery systems

Inventory Management

Businesses use MDPs to optimize stock levels and reduce losses.

Time Series Analysis

A time series is a sequence of observations collected over time.

Examples:

Stock prices
Temperature readings
Sales reports

Components of Time Series

1. Trend

Long-term increase or decrease in data.

2. Seasonality

Patterns repeating at regular intervals.

3. Cyclicality

Long-term fluctuations without fixed periodicity.

4. Noise

Random and unpredictable variations.

Time Series Forecasting

Time series forecasting predicts future values based on historical observations.

One of the most common forecasting techniques is ARIMA (AutoRegressive Integrated Moving Average).

X_{t} = [ϕ_{1} x_{t - 1} + ϕ_{2} x_{t - 2} + \dots + ϕ_{p} x_{t - p}] + [Δ^{d} x_{t}] + [ϕ_{1} E_{t - 1} + \dots + ϕ_{q} E_{t - q}]

ARIMA Components

AutoRegressive (AR)

x_{t} = ϕ_{1} x_{t - 1} + \dots + ϕ_{p} x_{t - p} + E_{t}

Where:

$ϕ_{i}$ = coefficients
$p$ = AR order
$E_{t}$ = white noise

Integrated (I)

Δ^{d} x_{t} = x_{t} - x_{t - 1}

Used to make the series stationary.

Moving Average (MA)

M A = ϕ_{1} E_{t - 1} + \dots + ϕ_{q} E_{t - q} + E_{t}

Uses past forecasting errors to improve predictions.

ARIMA Hyperparameters

$p$ Number of lag observations.

$d$ Number of differencing operations.

$q$ Order of the moving average model.

These parameters are tuned experimentally for better forecasting accuracy.

Markov Decision Process (MDP)

Markov Decision Process (MDP)

Introduction

Key Terms in MDP

1. State (S)

2. Action (A)

3. Transition Function (T)

4. Reward (R)

5. Policy (π)

Return (Gt)

Value Function

State Value Function (MRP)

Bellman Expectation Equation

State Value Function in MDP

Action Value Function

Optimal Value Functions

Optimal State Value Function

Optimal Action Value Function

Markov Property (Memoryless Property)

Policy Gradient Method

1. Value-Based Methods

2. Policy-Based Methods

Actor-Critic Algorithm

Components

Steps of Actor-Critic Algorithm

Applications of MDP

Time Series Analysis

Components of Time Series

Time Series Forecasting

ARIMA Components

ARIMA Hyperparameters

Return $(Gt)$