Markov Decision Process (MDP)

Nov 19, 2025
Updated 1 day ago
3 min read


Markov Decision Process (MDP)

Date: 06 Oct 2025

Introduction

A Markov Decision Process (MDP) is a mathematical framework used in reinforcement learning, machine learning, and artificial intelligence for solving decision-making problems where outcomes are partially random and partially controlled by an agent.

MDPs are widely used in robotics, autonomous systems, gameplay AI, healthcare planning, recommendation systems, and self-driving vehicles.


Key Terms in MDP

1. State (S)

A state represents the current situation or environment of the agent.

Example:

  • In a chess game, the arrangement of pieces on the board is a state.

  • In robotics, the robot’s current position is a state.


2. Action (A)

An action is a possible move the agent can take.

Each state can have one or more actions available.

Example:

  • Move left

  • Move right

  • Pick object

  • Stop


3. Transition Function (T)

The transition function defines how the environment changes after an action.

It represents the probability of moving from one state to another after taking an action.

Cartesian Product

where:


4. Reward (R)

A reward is a numerical feedback signal received after taking an action.

  • Positive reward → desirable outcome

  • Negative reward → undesirable outcome

Example:

  • Winning a game → +100 reward

  • Crashing a robot → −50 reward


5. Policy (π)

A policy defines the behavior of the agent.

The policy tells the agent what action to take in each state.


Return

The return is the total future reward accumulated by the agent.

Where:

  • = discount factor

Interpretation:

  • If is close to 0 → the agent focuses on immediate rewards.

  • If is close to 1 → the agent focuses on long-term rewards.


Value Function

A value function estimates how beneficial a state or action is in the long run.

State Value Function (MRP)

It represents the expected return starting from state .


Bellman Expectation Equation

The Bellman equation breaks the value into:

  • Immediate reward

  • Discounted future value

This recursive structure is the foundation of reinforcement learning algorithms.


State Value Function in MDP

This calculates the expected return while following policy π\piπ.


Action Value Function

The action value function estimates the quality of taking action in state .


Optimal Value Functions

Optimal value functions help determine the best possible policy.

Optimal State Value Function
Optimal Action Value Function

Markov Property (Memoryless Property)

MDP follows the Markov Property, meaning the future depends only on the current state and action, not on past history.

This property simplifies complex decision-making systems.


Policy Gradient Method

Policy gradient methods are reinforcement learning techniques used to directly optimize policies.

They are mainly divided into two categories:

1. Value-Based Methods

Value-based methods first learn the value function and then derive a policy from it.

Examples:

  • Q-Learning

  • Deep Q-Networks (DQN)


2. Policy-Based Methods

Policy-based methods directly learn the policy.

Stochastic policy:

These methods optimize the expected reward objective function .

Advantages:

  • Better for continuous action spaces

  • More stable in some environments


Actor-Critic Algorithm

The Actor-Critic algorithm combines value-based and policy-based reinforcement learning approaches.

Components

Actor

  • Chooses actions

  • Learns the policy

Critic

  • Evaluates actions

  • Computes value functions and TD error


Steps of Actor-Critic Algorithm

  1. Sample using policy

  2. Compute advantage function

  1. Compute policy gradient

  1. Update policy parameters

  1. Update critic weights

Repeat until the optimal policy is achieved.


Applications of MDP

Robotics

Robots use MDPs for:

  • Navigation

  • Path planning

  • Obstacle avoidance

Gameplay AI

Used in:

  • Chess engines

  • Video game agents

  • Strategy optimization

Healthcare

Doctors and AI systems use MDPs for treatment planning under uncertain outcomes.

Traffic and Navigation

Applications include:

  • Self-driving cars

  • GPS route optimization

  • Delivery systems

Inventory Management

Businesses use MDPs to optimize stock levels and reduce losses.


Time Series Analysis

A time series is a sequence of observations collected over time.

Examples:

  • Stock prices

  • Temperature readings

  • Sales reports

Components of Time Series

1. Trend

Long-term increase or decrease in data.

2. Seasonality

Patterns repeating at regular intervals.

3. Cyclicality

Long-term fluctuations without fixed periodicity.

4. Noise

Random and unpredictable variations.


Time Series Forecasting

Time series forecasting predicts future values based on historical observations.

One of the most common forecasting techniques is ARIMA (AutoRegressive Integrated Moving Average).


ARIMA Components

AutoRegressive (AR)

Where:

  • ​ = coefficients

  • = AR order

  • = white noise


Integrated (I)

Used to make the series stationary.


Moving Average (MA)

Uses past forecasting errors to improve predictions.


ARIMA Hyperparameters

  • Number of lag observations.

  • Number of differencing operations.

  • Order of the moving average model.

These parameters are tuned experimentally for better forecasting accuracy.