Markov Decision Process (MDP)
Markov Decision Process (MDP)
Date: 06 Oct 2025
Introduction
A Markov Decision Process (MDP) is a mathematical framework used in reinforcement learning, machine learning, and artificial intelligence for solving decision-making problems where outcomes are partially random and partially controlled by an agent.
MDPs are widely used in robotics, autonomous systems, gameplay AI, healthcare planning, recommendation systems, and self-driving vehicles.
Key Terms in MDP
1. State (S)
A state represents the current situation or environment of the agent.
Example:
In a chess game, the arrangement of pieces on the board is a state.
In robotics, the robot’s current position is a state.
2. Action (A)
An action is a possible move the agent can take.
Each state can have one or more actions available.
Example:
Move left
Move right
Pick object
Stop
3. Transition Function (T)
The transition function defines how the environment changes after an action.
It represents the probability of moving from one state to another after taking an action.
Cartesian Product
where:
4. Reward (R)
A reward is a numerical feedback signal received after taking an action.
Positive reward → desirable outcome
Negative reward → undesirable outcome
Example:
Winning a game → +100 reward
Crashing a robot → −50 reward
5. Policy (π)
A policy defines the behavior of the agent.
The policy tells the agent what action to take in each state.
Return
The return is the total future reward accumulated by the agent.
Where:
= discount factor
Interpretation:
If is close to 0 → the agent focuses on immediate rewards.
If is close to 1 → the agent focuses on long-term rewards.
Value Function
A value function estimates how beneficial a state or action is in the long run.
State Value Function (MRP)
It represents the expected return starting from state .
Bellman Expectation Equation
The Bellman equation breaks the value into:
Immediate reward
Discounted future value
This recursive structure is the foundation of reinforcement learning algorithms.
State Value Function in MDP
This calculates the expected return while following policy π\piπ.
Action Value Function
The action value function estimates the quality of taking action in state .
Optimal Value Functions
Optimal value functions help determine the best possible policy.
Optimal State Value Function
Optimal Action Value Function
Markov Property (Memoryless Property)
MDP follows the Markov Property, meaning the future depends only on the current state and action, not on past history.
This property simplifies complex decision-making systems.
Policy Gradient Method
Policy gradient methods are reinforcement learning techniques used to directly optimize policies.
They are mainly divided into two categories:
1. Value-Based Methods
Value-based methods first learn the value function and then derive a policy from it.
Examples:
Q-Learning
Deep Q-Networks (DQN)
2. Policy-Based Methods
Policy-based methods directly learn the policy.
Stochastic policy:
These methods optimize the expected reward objective function .
Advantages:
Better for continuous action spaces
More stable in some environments
Actor-Critic Algorithm
The Actor-Critic algorithm combines value-based and policy-based reinforcement learning approaches.
Components
Actor
Chooses actions
Learns the policy
Critic
Evaluates actions
Computes value functions and TD error
Steps of Actor-Critic Algorithm
Sample using policy
Compute advantage function
Compute policy gradient
Update policy parameters
Update critic weights
Repeat until the optimal policy is achieved.
Applications of MDP
Robotics
Robots use MDPs for:
Navigation
Path planning
Obstacle avoidance
Gameplay AI
Used in:
Chess engines
Video game agents
Strategy optimization
Healthcare
Doctors and AI systems use MDPs for treatment planning under uncertain outcomes.
Traffic and Navigation
Applications include:
Self-driving cars
GPS route optimization
Delivery systems
Inventory Management
Businesses use MDPs to optimize stock levels and reduce losses.
Time Series Analysis
A time series is a sequence of observations collected over time.
Examples:
Stock prices
Temperature readings
Sales reports
Components of Time Series
1. Trend
Long-term increase or decrease in data.
2. Seasonality
Patterns repeating at regular intervals.
3. Cyclicality
Long-term fluctuations without fixed periodicity.
4. Noise
Random and unpredictable variations.
Time Series Forecasting
Time series forecasting predicts future values based on historical observations.
One of the most common forecasting techniques is ARIMA (AutoRegressive Integrated Moving Average).
ARIMA Components
AutoRegressive (AR)
Where:
= coefficients
= AR order
= white noise
Integrated (I)
Used to make the series stationary.
Moving Average (MA)
Uses past forecasting errors to improve predictions.
ARIMA Hyperparameters
Number of lag observations.
Number of differencing operations.
Order of the moving average model.
These parameters are tuned experimentally for better forecasting accuracy.
