What is reinforcement learning?

Reinforcement learning is a branch of machine learning concerned with learning to make a sequence of decisions that best achieve a goal within an unknown environment. Unlike deep learning, reinforcement learning doesn’t involve training a model with a curated dataset to perform a specific, known task. Instead, we train an agent through trial and error within the target environment, continuously adjusting the agent’s actions based on feedback called the reward (eg. did the agent win or not). This is why reinforcement learning is well suited to playing games–it can find optimal strategies without knowing anything about the game to begin with.

Writing an agent from scratch

The most basic form of reinforcement learning is Q-learning, where the agent learns a “Q-table” containing every possible situation and the corresponding action to take.

In any environment, everything is broken down into “states” and “actions.” The states are observations and samplings that we pull from the environment, and the actions are the choices the agent makes based on the current state.

To get started quickly with an environment, we’ll use OpenAI’s gym, a python library that provides a variety of pre-defined environments, like games and stimulations. The environment we’ll be using is ‘Mountaincar-v0’, where the objective is to reach the flag by pushing left or right. Our “action space” is size 3: we can push left, push right, or stay still.

1 2 3 4 import gym env = gym.make("MountainCar-v0") print(env.reset()) 

Let’s implement a simple strategy, only accelerating to the right (action=2), to see how we can work with the environment:

1 2 3 4 5 6 7 8 9 10 import gym env = gym.make("MountainCar-v0") state = env.reset() done = False while not done: action = 2 new_state, reward, done, _ = env.step(action) print(reward, new_state) 

Each action we take in the environment returns a new state, reward, and whether or not the game is finished (either through winning or reaching the limit of 200 steps). With our current strategy, the reward after each step is simply -1, the result of not reaching the flag.

Our agent needs to learn to build up momentum to reach the flag. For this, let’s try and implement Q-learning.

The way Q-Learning works is there’s a “Q” value per action possible per state. This creates a table. In order to figure out all of the possible states, we can either query the environment or we just simply have to engage in the environment for a while to figure it out. In our case, gym provides the observation spaces:

1 2 3 4 print(env.observation_space.high) print(env.observation_space.low) > [0.6 0.07] > [-1.2 -0.07] 

We need to make our observation space granular, so let’s use 20 groups for each range. Our Q-table becomes:

1 2 DISCRETE_OS_SIZE = [20, 20] q_table = np.random.uniform(low=-2, high=0, size=(DISCRETE_OS_SIZE + [env.action_space.n])) 

Now, we need to update this Q-table through trial and error. The way we update the table is defined as follows:

Or, in code,

1 new_q = (1 - LEARNING_RATE) * current_q + LEARNING_RATE * (reward + DISCOUNT * max_future_q) 

The DISCOUNT is a measure of how much we want to care about FUTURE reward rather than immediate reward. Typically, this value will be fairly high, and is between 0 and 1. We want it high because the purpose of Q Learning is indeed to learn a chain of events that ends with a positive outcome, so it’s only natural that we put greater importance on long terms gains rather than short term ones.

The max_future_q is grabbed after we’ve performed our action already, and then we update our previous values based partially on the next-step’s best Q value. Over time, once we’ve reached the objective once, this “reward” value gets slowly back-propagated, one step at a time, per episode. Super basic concept, but pretty neat how it works!

Now, we just need to write out the surrounding logic that steps through the environment:

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 import gym import numpy as np env = gym.make("MountainCar-v0") LEARNING_RATE = 0.1 DISCOUNT = 0.95 EPISODES = 25000 SHOW_EVERY = 3000 DISCRETE_OS_SIZE = [20, 20] discrete_os_win_size = (env.observation_space.high - env.observation_space.low)/DISCRETE_OS_SIZE # Exploration settings epsilon = 1 # not a constant, qoing to be decayed START_EPSILON_DECAYING = 1 END_EPSILON_DECAYING = EPISODES//2 epsilon_decay_value = epsilon/(END_EPSILON_DECAYING - START_EPSILON_DECAYING) q_table = np.random.uniform(low=-2, high=0, size=(DISCRETE_OS_SIZE + [env.action_space.n])) def get_discrete_state(state): discrete_state = (state - env.observation_space.low)/discrete_os_win_size return tuple(discrete_state.astype(np.int)) # we use this tuple to look up the 3 Q values for the available actions in the q-table for episode in range(EPISODES): discrete_state = get_discrete_state(env.reset()) done = False if episode % SHOW_EVERY == 0: render = True print(episode) else: render = False while not done: if np.random.random() > epsilon: # Get action from Q table action = np.argmax(q_table[discrete_state]) else: # Get random action action = np.random.randint(0, env.action_space.n) new_state, reward, done, _ = env.step(action) new_discrete_state = get_discrete_state(new_state) if episode % SHOW_EVERY == 0: env.render() #new_q = (1 - LEARNING_RATE) * current_q + LEARNING_RATE * (reward + DISCOUNT * max_future_q) # If simulation did not end yet after last step - update Q table if not done: # Maximum possible Q value in next step (for new state) max_future_q = np.max(q_table[new_discrete_state]) # Current Q value (for current state and performed action) current_q = q_table[discrete_state + (action,)] # And here's our equation for a new Q value for current state and action new_q = (1 - LEARNING_RATE) * current_q + LEARNING_RATE * (reward + DISCOUNT * max_future_q) # Update Q table with new Q value q_table[discrete_state + (action,)] = new_q # Simulation ended (for any reson) - if goal position is achived - update Q value with reward directly elif new_state[0] >= env.goal_position: #q_table[discrete_state + (action,)] = reward q_table[discrete_state + (action,)] = 0 discrete_state = new_discrete_state # Decaying is being done every episode if episode number is within decaying range if END_EPSILON_DECAYING >= episode >= START_EPSILON_DECAYING: epsilon -= epsilon_decay_value env.close()