[RL basics] Week 2. Q-learning

Policy and Value-function • State-Value and State-Action • Bellman Equation • Monte Carlo & Temporal Difference

Policy and Value function

The main RL goal is to find an optimal policy. For this, we have two approaches:

Policy-based: directly learn the policy (action to take given any state)
Value based: train value function that assigns a value for each state. Then, Policy = actions as function of value (ex. maximum)

Policy

Which action to take given current state?

Our policy is a neural network: it takes a state as input vector and outputs what action to take at that state

Value

Which state has the highest value?

Value function is a neural network: it takes a state as input vector and outputs the value of a state or a state-action pair.
The action taken is one that has maximum value.

State-Value and State-Action value

The State-Value function

State-Value function outputs the expected return if the agent starts at that state, and then follow the policy π forever after
(for all future time steps)

The Action-Value function

Action-Value function outputs the expected return if the agent starts at that state and takes that action, and then follow the policy π forever after

Then, your policy is just a simple function that you specify. A common example of greedy policy is argmax of Value function:

Bellman Equation

The Bellman equation is a recursive equation:

Instead of computing the expected reward as the sum of all future rewards, we can compute only the immediate reward plus discounted value of the next state St+1