## What is reinforcement learning?

Reinforcement learning is a branch of machine learning concerned with learning to make a sequence of decisions that best achieve a goal within an unknown environment. Unlike deep learning, reinforcement learning doesn’t involve training a model with a curated dataset to perform a specific, known task. Instead, we train an *agent* through trial and error within the target environment, continuously adjusting the agent’s actions based on feedback called the *reward* (eg. did the agent win or not). This is why reinforcement learning is well suited to playing games–it can find optimal strategies without knowing anything about the game to begin with.

## Writing an agent from scratch

The most basic form of reinforcement learning is Q-learning, where the agent learns a “Q-table” containing every possible situation and the corresponding action to take.

In any environment, everything is broken down into “states” and “actions.” The states are observations and samplings that we pull from the environment, and the actions are the choices the agent makes based on the current state.

To get started quickly with an environment, we’ll use OpenAI’s gym, a python library that provides a variety of pre-defined environments, like games and stimulations. The environment we’ll be using is ‘Mountaincar-v0’, where the objective is to reach the flag by pushing left or right. Our “action space” is size 3: we can push left, push right, or stay still.

1
2
3
4

import gym
env = gym.make("MountainCar-v0")
print(env.reset())

Let’s implement a simple strategy, only accelerating to the right (action=2), to see how we can work with the environment:

1
2
3
4
5
6
7
8
9
10

import gym
env = gym.make("MountainCar-v0")
state = env.reset()
done = False
while not done:
action = 2
new_state, reward, done, _ = env.step(action)
print(reward, new_state)

Each action we take in the environment returns a new state, reward, and whether or not the game is finished (either through winning or reaching the limit of 200 steps). With our current strategy, the reward after each step is simply -1, the result of not reaching the flag.

Our agent needs to learn to build up momentum to reach the flag. For this, let’s try and implement Q-learning.

The way Q-Learning works is there’s a “Q” value per action possible per state. This creates a table. In order to figure out all of the possible states, we can either query the environment or we just simply have to engage in the environment for a while to figure it out. In our case, gym provides the observation spaces:

1
2
3
4

print(env.observation_space.high)
print(env.observation_space.low)
> [0.6 0.07]
> [-1.2 -0.07]

We need to make our observation space granular, so let’s use 20 groups for each range. Our Q-table becomes:

1
2

DISCRETE_OS_SIZE = [20, 20]
q_table = np.random.uniform(low=-2, high=0, size=(DISCRETE_OS_SIZE + [env.action_space.n]))

Now, we need to update this Q-table through trial and error. The way we update the table is defined as follows:

Or, in code,

1

new_q = (1 - LEARNING_RATE) * current_q + LEARNING_RATE * (reward + DISCOUNT * max_future_q)

The DISCOUNT is a measure of how much we want to care about FUTURE reward rather than immediate reward. Typically, this value will be fairly high, and is between 0 and 1. We want it high because the purpose of Q Learning is indeed to learn a chain of events that ends with a positive outcome, so it’s only natural that we put greater importance on long terms gains rather than short term ones.

The max_future_q is grabbed after we’ve performed our action already, and then we update our previous values based partially on the next-step’s best Q value. Over time, once we’ve reached the objective once, this “reward” value gets slowly back-propagated, one step at a time, per episode. Super basic concept, but pretty neat how it works!

Now, we just need to write out the surrounding logic that steps through the environment:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86

import gym
import numpy as np
env = gym.make("MountainCar-v0")
LEARNING_RATE = 0.1
DISCOUNT = 0.95
EPISODES = 25000
SHOW_EVERY = 3000
DISCRETE_OS_SIZE = [20, 20]
discrete_os_win_size = (env.observation_space.high - env.observation_space.low)/DISCRETE_OS_SIZE
# Exploration settings
epsilon = 1 # not a constant, qoing to be decayed
START_EPSILON_DECAYING = 1
END_EPSILON_DECAYING = EPISODES//2
epsilon_decay_value = epsilon/(END_EPSILON_DECAYING - START_EPSILON_DECAYING)
q_table = np.random.uniform(low=-2, high=0, size=(DISCRETE_OS_SIZE + [env.action_space.n]))
def get_discrete_state(state):
discrete_state = (state - env.observation_space.low)/discrete_os_win_size
return tuple(discrete_state.astype(np.int)) # we use this tuple to look up the 3 Q values for the available actions in the q-table
for episode in range(EPISODES):
discrete_state = get_discrete_state(env.reset())
done = False
if episode % SHOW_EVERY == 0:
render = True
print(episode)
else:
render = False
while not done:
if np.random.random() > epsilon:
# Get action from Q table
action = np.argmax(q_table[discrete_state])
else:
# Get random action
action = np.random.randint(0, env.action_space.n)
new_state, reward, done, _ = env.step(action)
new_discrete_state = get_discrete_state(new_state)
if episode % SHOW_EVERY == 0:
env.render()
#new_q = (1 - LEARNING_RATE) * current_q + LEARNING_RATE * (reward + DISCOUNT * max_future_q)
# If simulation did not end yet after last step - update Q table
if not done:
# Maximum possible Q value in next step (for new state)
max_future_q = np.max(q_table[new_discrete_state])
# Current Q value (for current state and performed action)
current_q = q_table[discrete_state + (action,)]
# And here's our equation for a new Q value for current state and action
new_q = (1 - LEARNING_RATE) * current_q + LEARNING_RATE * (reward + DISCOUNT * max_future_q)
# Update Q table with new Q value
q_table[discrete_state + (action,)] = new_q
# Simulation ended (for any reson) - if goal position is achived - update Q value with reward directly
elif new_state[0] >= env.goal_position:
#q_table[discrete_state + (action,)] = reward
q_table[discrete_state + (action,)] = 0
discrete_state = new_discrete_state
# Decaying is being done every episode if episode number is within decaying range
if END_EPSILON_DECAYING >= episode >= START_EPSILON_DECAYING:
epsilon -= epsilon_decay_value
env.close()