RL sandbox

First steps in Reinforcement learning

Lunar Lender (simple RL example)

Lunar Lender is a good example to learn the general concept of RL with its Observation, Action and Reward.

Observation

(X,Y) position
(X,Y) speed
(angle, angle speed)
(ifLeft, ifRight)

Action

do nothing
fire left engine
fire main engine
fire right engine

Reward

move from top to landing pad +120pts
fire main engine -0.3pts / frame
each leg contact +10pts
lander crashes -100pts
lander come to rest +100pts

Framework: Stable Baseline 3

Model: Proximal Policy Optimization (PPO)

Policy: Multi Linear Perceptron

Frozen Lake (Q-learning)

Frozen Lake is a good examples of Q-learning as this is a tabular game.

Environment

R(rows) ✕ C(columns) matrix

Observation

agent current position [i,j]

Actions

Rewards

Reach gift(G): +1
reach hole(H): 0
reach frozen(F): 0

Personal enhancement:

add -1 reward if 'Hole'

Slippery setup (True or False). If True, agent will move in intended direction with probability of 1/3 else will move in either perpendicular direction with equal probability of 1/3 in both directions.

As we see, non sliperry strategy is trivial : ex. R-R-D-D-D-R, but we will see that it is a very bad strategy within slippery setup.

Slippery setup is a good example where random circumstances might dramatically change the best (deterministic) strategy.

Trivial strategy will result in getting into the hole in 94% of time!

Example of learned Q table for Frozen Lake game

Final result: by using Q-learning we learned the strategy which in 74% reaches the goal with only 0.44 variance

The idea of this project was to show that Q-learning is able to learn the 'tricky' Slippery strategy

Taxi (Q-learning)

Taxi is a good example of sparse reward

to agent should figure out how to find the passanger's location, then when to use 'pickup' action correctly, then where to go with passanger and finally when properly use 'drop off' action

Note that illegal 'pickup' and 'drop off' result in negative reward, so that the agent might want to never use it at all.

Description:

four special locations Red(0), Green(1), Yellow(2), and Blue(3) from total of 25 places.
taxi starts at a random square, the passenger at a random location.
The taxi drives to the passenger, picks up him, drives to the destination and drops off the passenger. If so, episode ends

Environment

R(rows) ✕ C(columns) matrix

Observation (25x5x4)

taxi position [i,j] (25)
client position (0,1,2,3,4)
client destination (0,1,2,3)

Actions

Rewards

-1 per step unless other reward is triggered.
+20 delivering passenger.
-10 executing “pickup” and “drop-off” actions illegally.

Pong

Page updated

Google Sites

Report abuse