Hands-on Tutorial on Reinforcement Learning With Python

Published on Mon Oct 30 2023 00:00:00 GMT+0000 (Coordinated Universal Time)

Reinforcement Learning
OpenAI Gym
Weights & Biases

Delve into the core concepts of Reinforcement Learning through a hands-on tutorial on Q-learning with Python, OpenAI's Gym, and Weights & Biases. Explore, learn, and enhance your understanding of RL in a grid world scenario.

A grid world representation showcasing the agent's path to the goal

Hands-on Tutorial on Reinforcement Learning With Python

Reinforcement Learning (RL) is a type of machine learning where an agent learns how to make decisions by interacting with an environment, aiming to maximize some notion of cumulative reward. The agent learns a policy, which is a strategy to decide the next action based on the current state to achieve the goal. Among various algorithms within the domain of RL, Q-learning is a popular method known for its simplicity and effectiveness. In this tutorial, we will delve into Q-learning through a grid world scenario, utilizing Python, OpenAI’s Gym, and Weights & Biases (wandb) for logging and visualization. This hands-on approach aims to provide a concrete understanding of the core concepts of RL and Q-learning, offering a practical foundation for those keen on diving deeper into this fascinating field. Through interactive examples and detailed explanations, you’ll gain insights into the mechanisms of Q-learning, and how tools like OpenAI’s Gym and wandb can facilitate and enhance the learning experience.

Understanding the Basics

In the realm of Reinforcement Learning (RL), the interaction between an agent and its environment is crucial. The agent observes the current state of the environment, decides on an action based on this observation, executes the action, and receives feedback in the form of a reward or penalty. This feedback helps the agent evaluate the effectiveness of its action, guiding its future decisions.

The essence of RL lies in learning a strategy or policy, which is a mapping from states to actions that maximizes the cumulative reward over time. The policy can be deterministic, where a specific action is chosen for each state, or stochastic, where a probability distribution over actions is defined for each state.

Q-learning, a widely acknowledged method within RL, targets learning the value of actions, denoted as Q-values, in each state to inform the agent on the best action to take. The Q-value quantifies the total expected rewards an agent can obtain, starting from a state and taking an action according to a particular policy.

In this tutorial, we’ll explore Q-learning by developing a grid world environment using OpenAI’s Gym, a toolkit for comparing reinforcement learning algorithms. Our journey will also involve leveraging Weights & Biases (wandb) to log and visualize the learning process, offering a clear perspective on how the agent improves over time. This venture is structured for individuals with a fundamental understanding of Python, ready to navigate the intricacies of RL and Q-learning.


Before embarking on this tutorial, ensuring you have a solid footing will make the journey smoother. Here’s what you’ll need:

  1. Proficiency in Python: A basic grasp of Python programming is essential as we’ll be using it extensively throughout this tutorial.
  2. **Elementary Knowledge of Reinforcement Learning: **Familiarity with core concepts of RL such as states, actions, and rewards will be beneficial. If you’re new to RL, consider going through a basic tutorial to get acclimated.
  3. **A Weights & Biases Account: **We’ll utilize Weights & Biases (wandb) for logging and visualizing our learning process metrics. Setting up an account beforehand will streamline the process.

Having these prerequisites in place will equip you to get the most out of this hands-on exploration into Q-learning through a grid world scenario.

Setting Up Your Work Environment

Embarking on any project, it’s prudent to keep things organized and replicable. A good practice is to set up a virtual environment to manage dependencies. Here’s how you can set it up and install the necessary libraries for this tutorial:

  1. Create a Virtual Environment: It’s a good practice to create a virtual environment for your project to manage dependencies. Navigate to your project directory and run the following command to create a new virtual environment:


python3 -m venv rl-venv
  1. **Activate the Virtual Environment: **Before installing the libraries, activate the virtual environment:

On Windows, use:



On macOS and Linux, use:


source rl-venv/bin/activate
  1. Install Necessary Libraries: With the virtual environment activated, install the required libraries using pip. In this tutorial, we’ll need gym for creating our grid world environment, wandb for logging and visualization, and numpy for numerical operations:


pip install gym wandb numpy

Now, with the virtual environment set up and the necessary libraries installed, you’re well-prepared to dive into the realm of Reinforcement Learning.

Project Walkthrough

Embarking on this hands-on journey, we’ll be architecting a simplified grid world using OpenAI’s Gym—a toolkit for developing and comparing reinforcement learning algorithms. Our virtual agent will tread through this grid, aiming to reach a designated goal point while dodging any obstacles that cross its path. Although we’re utilizing a basic grid world (rl_gridworld) in this tutorial, feel free to propel your knowledge further by experimenting with other environments like CartPole-v1 or MountainCar-v0 in your future projects. These environments pose different challenges that can be quite enlightening.

WandB Initialization

Let’s initiate a run with Weights & Biases (wandb) to log and visualize our project’s metrics. WandB is a platform for data scientists and machine learning practitioners to visualize and compare machine learning experiments.


import wandb

# Initiating a wandb run
wandb.init(project='rl_gridworld', name='q_learning')

# Setting up the configuration parameters
config = wandb.config
config.learning_rate = 0.1
config.discount_factor = 0.95
config.exploration_rate = 1.0

Crafting the Grid World

Our next stride is towards crafting a custom environment for our grid world using OpenAI Gym.


import gym
from gym import spaces

class GridWorld(gym.Env):
    def __init__(self, grid_size=5):
        super(GridWorld, self).__init__()

        self.grid_size = grid_size
        self.current_position = (0, 0)  # Starting position
        self.goal_position = (grid_size-1, grid_size-1)  # Goal position
        self.action_space = spaces.Discrete(4)  # Up, Down, Left, Right
        self.observation_space = spaces.Discrete(grid_size * grid_size)  # Grid cells

    def step(self, action):
        x, y = self.current_position
        if action == 0:  # Up
            x = max(0, x-1)
        elif action == 1:  # Down
            x = min(self.grid_size-1, x+1)
        elif action == 2:  # Left
            y = max(0, y-1)
        elif action == 3:  # Right
            y = min(self.grid_size-1, y+1)
        self.current_position = (x, y)
        reward = 1 if self.current_position == self.goal_position else -1
        done = self.current_position == self.goal_position
        return self.current_position, reward, done, {}

    def reset(self):
        self.current_position = (0, 0)
        return self.current_position

Breathing Life into the Q-learning Agent

With the stage set, it’s time to introduce our Q-learning agent into this world. This agent will be equipped with methods to choose actions, update Q-values, and interact with the environment.


import numpy as np

class QLearningAgent:
    def __init__(self, env, learning_rate, discount_factor, exploration_rate):
        self.env = env
        self.learning_rate = learning_rate
        self.discount_factor = discount_factor
        self.exploration_rate = exploration_rate
        # Initialize Q-table with zeros
        self.q_table = np.zeros((env.observation_space.n, env.action_space.n))
    def choose_action(self, state):
        # Implementing the epsilon-greedy policy for action selection
        if np.random.uniform(0, 1) < self.exploration_rate:
            return self.env.action_space.sample()  # Explore
            return np.argmax(self.q_table[state, :])  # Exploit
    def update_q_table(self, state, action, reward, next_state):
        # Update the Q-values based on the formula
        best_next_action = np.argmax(self.q_table[next_state, :])
        updated_value = (1 - self.learning_rate) * self.q_table[state, action] + \
                        self.learning_rate * (reward + self.discount_factor * self.q_table[next_state, best_next_action])
        self.q_table[state, action] = updated_value
    def train(self, episodes):
        # Training the agent through episodes
        for episode in range(episodes):
            state = self.env.reset()
            done = False
            episode_reward = 0  # Track the cumulative reward each episode
            while not done:
                action = self.choose_action(state)
                next_state, reward, done, _ = self.env.step(action)
                self.update_q_table(state, action, reward, next_state)
                state = next_state
                episode_reward += reward  # Accumulate rewards for the episode
                wandb.log({"Reward": reward, "Exploration Rate": self.exploration_rate})

            # Logging the total episode reward and reducing exploration rate
            wandb.log({"Episode Reward": episode_reward})
            self.exploration_rate *= 0.995  # Exponential decay of exploration rate

# Instantiate and train the agent
agent = QLearningAgent(env=GridWorld(), learning_rate=config.learning_rate, 
                       discount_factor=config.discount_factor, exploration_rate=config.exploration_rate)

In this section, we’ve diligently set up our environment, initialized WandB for logging, and implemented a Q-learning agent to interact with the grid world. Through each episode, the agent learns from its actions and refines its strategy to reach the goal. The wandb platform serves as a window into this learning process, providing insights through visualizations and logs.

Visualizing Training

With your agent trained, it’s time to analyze its performance using Weights & Biases (wandb). Navigate to your wandb dashboard and find the ‘rl_gridworld’ project. Inside, you’ll find a new run entry showcasing various metrics logged during training. This dashboard provides insights into how the agent’s learning evolved over the episodes, displaying metrics such as cumulative rewards and exploration rate decay. These visualizations are instrumental in understanding the agent’s behavior and the effectiveness of the training setup, assisting in potential refinements for future RL projects.

Wrapping Up

Through this tutorial, you’ve gained a hands-on understanding of Q-learning by developing a grid world scenario. You’ve seen how an agent interacts with its environment and how to visualize and analyze its performance using Weights & Biases. This project serves as a stepping stone to further explore and scale your Reinforcement Learning endeavors. With wandb at your fingertips, you’re well-equipped to monitor and refine your RL agents across a variety of challenges, propelling your projects to new heights.

Further Explorations

With a foundational understanding of Q-learning under your belt, the landscape of reinforcement learning (RL) unfolds with a multitude of directions for exploration and deeper understanding. Here are some tailored suggestions:

  1. Advanced Environments:
    1. Engage with the CartPole-v1 environment in OpenAI’s Gym. Utilize libraries such as Stable Baselines to implement and compare various RL algorithms in achieving superior pole balancing performance.
  2. Real-World Applications: 2. Craft an RL-based trading bot for the stock market using libraries like TensorTrade or Gym-Trading. Experiment with diverse state representations and reward structures to navigate market dynamics.
  3. Custom Environments: 3. Develop a custom traffic intersection simulation using SUMO and apply RL to optimize traffic flow. Explore different traffic scenarios and evaluate the impact of autonomous vehicles on congestion.
  4. Multi-Agent Systems: 4. Design a competitive environment for a game of tic-tac-toe, utilizing Pymgrid for agent interactions. Investigate how training dynamics influence the learning and strategies of the competing agents.
  5. Deep Reinforcement Learning (DRL): 5. Transition into DRL by tackling the LunarLander-v2 environment in OpenAI’s Gym with a Deep Q-Network (DQN) using Keras-RL. Delve into how deep learning can capture complex state representations to enhance agent performance.
  6. Ethical AI in RL: 6. Venture into ethical AI by simulating a healthcare setting using RL, where agent decisions have consequential impacts. Utilize frameworks like AI Ethics Toolkit to guide the ethical design and evaluation of your project.
  7. Community Engagement: 7. Engage in a Kaggle competition focused on RL or contribute to open-source RL projects on GitHub. Delve into community-driven challenges to apply your skills on real-world problems, and interact with other enthusiasts to gain diverse perspectives.

Dive into these projects to further your RL knowledge. And if you create something cool, I’d love to see it—please feel free to reach out. To stay updated with upcoming tutorials, innovative projects, and the latest in RL and AI, subscribe to my newsletter.