Reinforcement learning is a machine learning technique that enables agents to learn from their environment through trial-and-error interactions. One of the key components of reinforcement learning is the Markov Decision Process (MDP), which provides a framework for modeling sequential decision-making problems. In this article, we will explore the basics of the Markov Decision Process in reinforcement learning.

1. What is a Markov Decision Process?

A Markov Decision Process is a mathematical framework that models a sequential decision-making problem in which an agent interacts with an environment. The environment is modeled as a set of states, and the agent’s actions can transition it from one state to another. The agent’s goal is to maximize some notion of cumulative reward over time.

Formally, a Markov Decision Process is defined as a tuple (S, A, P, R, γ), where:

S is a set of states. Each state s ∈ S represents a particular configuration of the environment.

A is a set of actions. Each action a ∈ A is a possible decision that the agent can take in a given state.

P is a state transition function that specifies the probability of transitioning from one state to another, given an action. That is, P(s’ | s, a) is the probability of transitioning to state s’ when taking action a in state s.

R is a reward function that assigns a scalar value to each state-action pair. The reward function represents the immediate feedback that the agent receives from the environment. The goal of the agent is to maximize the cumulative reward over time.

γ is a discount factor that determines the importance of future rewards. It is a scalar value between 0 and 1 that discounts the value of future rewards relative to immediate rewards.

2. The Markov Property

One of the key assumption of the markov decision Process is the Markov property, which states that the future state and reward only depend on the current state and action, and not on the past history of the agent’s interactions with the environment. In other words, the Markov property assumes that the present state contains all the relevant information about the environment necessary to make optimal decisions.

3. The Bellman equation

The Bellman equation is a fundamental equation in reinforcement learning that expresses the optimal value of a state as the expected value of the immediate reward plus the discounted value of the future rewards. Mathematically, the Bellman equation is defined as:

V(s) = max_a{R(s,a) + γ∑_{s’} P(s’ | s,a) V(s’)}

where V(s) is the value function that represents the expected cumulative reward from state s onward, and max_a{} denotes the maximum over all possible actions.

4. The optimal policy

The optimal policy is a function that maps each state to the best action to take in that state to maximize the cumulative reward over time. The optimal policy is defined as:

π^*(s) = argmax_a{R(s,a) + γ∑_{s’} P(s’ | s,a) V(s’)}

where argmax_a{} denotes the action that maximizes the expression inside the curly braces.

5. Solving a Markov Decision Process

Solving a Markov Decision Process (MDP) involves finding an optimal policy that maximizes the expected cumulative reward over time. There are several algorithms to solve an MDP, including value iteration, policy iteration, and Q-learning. In this article, we will focus on value iteration and policy iteration, which are two popular algorithms for solving MDPs.

5.1 Step 1: Define the MDP

The first step in solving an MDP is to define the environment. In the OpenAI Gym library, environments are defined as classes that have methods for taking actions, getting observations, and computing rewards. For example, the FrozenLake environment is a grid world game where the agent must navigate to the goal without falling into a hole:

import gym
env = gym.make(‘FrozenLake-v0’)

5.2 Step 2: Initialize the Value Function

The value function represents the expected cumulative reward for each state in the MDP. It can be initialized to zero for all states:

V = np.zeros(env.observation_space.n)

5.3 Step 3: Perform Value Iteration

V(s) = max_a sum_{s’} P(s,a,s’)[R(s,a,s’) + gamma V(s’)]

where V(s) is the value function for state s, a is an action, P(s,a,s’) is the transition probability from state s to s’ with action a, R(s,a,s’) is the reward for transitioning from state s to s’ with action a, and gamma is the discount factor.

import numpy as np

def value_iteration(env, gamma=0.9, theta=1e-5):
# Initialize the value function
V = np.zeros(env.observation_space.n)

while True:
delta = 0
for s in range(env.observation_space.n):
v = V[s]
# Update the value function
V[s] = max([sum([p*(r+gamma*V[s_]) for p, s_, r, _ in env.P[s][a]]) for a in range(env.action_space.n)])
delta = max(delta, abs(v – V[s]))
if delta < theta:
break

# Compute the policy
policy = np.zeros(env.observation_space.n, dtype=np.int)
for s in range(env.observation_space.n):
policy[s] = np.argmax([sum([p*(r+gamma*V[s_]) for p, s_, r, _ in env.P[s][a]]) for a in range(env.action_space.n)])

return policy, V

The value_iteration function takes an OpenAI Gym environment as input and returns the optimal policy and value function. The gamma parameter is the discount factor, and the theta parameter is the stopping criterion.

The function initializes the value function to zero for all states and iteratively updates it until convergence. The function then computes the optimal policy by selecting the action that maximizes the expected cumulative reward.

5.4 Step 4: Perform Policy Iteration

Policy iteration is an algorithm that alternates between policy evaluation and policy improvement steps until the policy is unchanged.

import numpy as np

def policy_iteration(env, gamma=0.9, theta=1e-5):
# Initialize the policy
policy = np.zeros(env.observation_space.n, dtype=np.int)

while True:
# Policy evaluation step
V = np.zeros(env.observation_space.n)
while True:
delta = 0
for s in range(env.observation_space.n):
v = V[s]
# Compute the value function for the current policy
V[s] = sum([p*(r+gamma*V[s_]) for p, s_, r, _ in env.P[s][policy[s]]])
delta = max(delta, abs(v – V[s]))
if delta < theta:
break

# Policy improvement step
policy_stable = True
for s in range(env.observation_space.n):
old_action = policy[s]
# Compute the action that maximizes the expected cumulative reward
policy[s] = np.argmax([sum([p*(r+gamma*V[s_]) for p, s_, r, _ in env.P[s][a]]) for a in range(env.action_space.n)])
if old_action != policy[s]:
policy_stable = False
if policy_stable:
break

return policy, V

The policy_iteration function takes an OpenAI Gym environment as input and returns the optimal policy and value function. The gamma parameter is the discount factor, and the theta parameter is the stopping criterion.

The function initializes the policy to a random policy and alternates between policy evaluation and policy improvement steps until the policy is unchanged. The function computes the value function for the current policy using the Bellman equation and then improves the policy by selecting the action that maximizes the expected cumulative reward.

5.5 Step 5: Test the Policy

Once we have computed the optimal policy and value function, we can test the policy by running the agent in the environment and observing its behavior. Here is the Python code to test the policy:

def test_policy(policy, env, episodes=100):
rewards = []
for episode in range(episodes):
obs = env.reset()
done = False
total_reward = 0
while not done:
action = policy[obs]
obs, reward, done, info = env.step(action)
total_reward += reward
rewards.append(total_reward)
return rewards

The test_policy function takes the optimal policy, the environment, and the number of episodes to run the policy as input and returns a list of the total rewards for each episode. The function runs the policy in the environment, collects the rewards, and returns the results.

6. Conclusion

In conclusion, the Markov Decision Process (MDP) is a powerful mathematical framework for modeling decision-making problems in a wide range of applications, particularly in reinforcement learning. An MDP defines a set of states, actions, transition probabilities, and rewards, and a decision-making agent seeks to maximize the expected cumulative reward over time by selecting the optimal sequence of actions.

Markov Decision Process in Reinforcement Learning Narender Kumar Spark By {Examples}