RL sandbox
First steps in Reinforcement learning
First steps in Reinforcement learning
Lunar Lender is a good example to learn the general concept of RL with its Observation, Action and Reward.
(X,Y) position
(X,Y) speed
(angle, angle speed)
(ifLeft, ifRight)
do nothing
fire left engine
fire main engine
fire right engine
move from top to landing pad +120pts
fire main engine -0.3pts / frame
each leg contact +10pts
lander crashes -100pts
lander come to rest +100pts
Framework: Stable Baseline 3
Model: Proximal Policy Optimization (PPO)
Policy: Multi Linear Perceptron
Frozen Lake is a good examples of Q-learning as this is a tabular game.
R(rows) ✕ C(columns) matrix
agent current position [i,j]
Reach gift(G): +1
reach hole(H): 0
reach frozen(F): 0
Personal enhancement:
add -1 reward if 'Hole'
Slippery setup (True or False). If True, agent will move in intended direction with probability of 1/3 else will move in either perpendicular direction with equal probability of 1/3 in both directions.
As we see, non sliperry strategy is trivial : ex. R-R-D-D-D-R, but we will see that it is a very bad strategy within slippery setup.
Slippery setup is a good example where random circumstances might dramatically change the best (deterministic) strategy.
Trivial strategy will result in getting into the hole in 94% of time!
Example of learned Q table for Frozen Lake game
Final result: by using Q-learning we learned the strategy which in 74% reaches the goal with only 0.44 variance
The idea of this project was to show that Q-learning is able to learn the 'tricky' Slippery strategy
Taxi is a good example of sparse reward
to agent should figure out how to find the passanger's location, then when to use 'pickup' action correctly, then where to go with passanger and finally when properly use 'drop off' action
Note that illegal 'pickup' and 'drop off' result in negative reward, so that the agent might want to never use it at all.
Description:
four special locations Red(0), Green(1), Yellow(2), and Blue(3) from total of 25 places.
taxi starts at a random square, the passenger at a random location.
The taxi drives to the passenger, picks up him, drives to the destination and drops off the passenger. If so, episode ends
R(rows) ✕ C(columns) matrix
taxi position [i,j] (25)
client position (0,1,2,3,4)
client destination (0,1,2,3)
-1 per step unless other reward is triggered.
+20 delivering passenger.
-10 executing “pickup” and “drop-off” actions illegally.