How Exploration Agents like Q-Learning, UCB, and MCTS Collaboratively Learn Intelligent Problem-Solving Strategies in Dynamic Grid Environments

In this tutorial, we explore how exploration strategies shape intelligent decision-making through agent-based problem solving. We build and train three agents, Q-Learning with epsilon-greedy exploration, Upper Confidence Bound (UCB), and Monte Carlo Tree Search (MCTS), to navigate a grid world and reach a goal efficiently while avoiding obstacles. Also, we experiment with different ways of balancing exploration and exploitation, visualize learning curves, and compare how each agent adapts and performs under uncertainty. Check out the FULL CODES here.

import numpy as np
import random
from collections import defaultdict, deque
import math
import matplotlib.pyplot as plt
from typing import List, Tuple, Dict


class GridWorld:
   def __init__(self, size=10, n_obstacles=15):
       self.size = size
       self.grid = np.zeros((size, size))
       self.start = (0, 0)
       self.goal = (size-1, size-1)
       obstacles = set()
       while len(obstacles)

We begin by creating a grid world environment that challenges our agent to reach a goal while avoiding obstacles. We design its structure, define movement rules, and ensure realistic navigation boundaries to simulate an interactive problem-solving space. This forms the foundation where our exploration agents will operate and learn. Check out the FULL CODES here.

class QLearningAgent:
   def __init__(self, n_actions=4, alpha=0.1, gamma=0.95, epsilon=1.0):
       self.n_actions = n_actions
       self.alpha = alpha
       self.gamma = gamma
       self.epsilon = epsilon
       self.q_table = defaultdict(lambda: np.zeros(n_actions))
   def get_action(self, state, valid_actions):
       if random.random()

We implement the Q-Learning agent that learns through experience, guided by an epsilon-greedy policy. We observe how it explores random actions early on and gradually focuses on the most rewarding paths. Through iterative updates, it learns to balance exploration and exploitation effectively.

class UCBAgent:
   def __init__(self, n_actions=4, c=2.0, gamma=0.95):
       self.n_actions = n_actions
       self.c = c
       self.gamma = gamma
       self.q_values = defaultdict(lambda: np.zeros(n_actions))
       self.action_counts = defaultdict(lambda: np.zeros(n_actions))
       self.total_counts = defaultdict(int)
   def get_action(self, state, valid_actions):
       self.total_counts[state] += 1
       ucb_values = []
       for action in valid_actions:
           q = self.q_values[state][action]
           count = self.action_counts[state][action]
           if count == 0:
               return action
           exploration_bonus = self.c * math.sqrt(math.log(self.total_counts[state]) / count)
           ucb_values.append((action, q + exploration_bonus))
       return max(ucb_values, key=lambda x: x[1])[0]
   def update(self, state, action, reward, next_state, valid_next_actions):
       self.action_counts[state][action] += 1
       count = self.action_counts[state][action]
       current_q = self.q_values[state][action]
       if valid_next_actions:
           max_next_q = max([self.q_values[next_state][a] for a in valid_next_actions])
       else:
           max_next_q = 0
       target = reward + self.gamma * max_next_q
       self.q_values[state][action] += (target - current_q) / count

We develop the UCB agent that uses confidence bounds to guide its exploration decisions. We watch how it strategically tries less-visited actions while prioritizing those that yield higher rewards. This approach helps us understand a more mathematically grounded exploration strategy. Check out the FULL CODES here.

class MCTSNode:
   def __init__(self, state, parent=None):
       self.state = state
       self.parent = parent
       self.children = {}
       self.visits = 0
       self.value = 0.0
   def is_fully_expanded(self, valid_actions):
       return len(self.children) == len(valid_actions)
   def best_child(self, c=1.4):
       choices = [(action, child.value / child.visits +
                   c * math.sqrt(2 * math.log(self.visits) / child.visits))
                  for action, child in self.children.items()]
       return max(choices, key=lambda x: x[1])


class MCTSAgent:
   def __init__(self, env, n_simulations=50):
       self.env = env
       self.n_simulations = n_simulations
   def search(self, state):
       root = MCTSNode(state)
       for _ in range(self.n_simulations):
           node = root
           sim_env = GridWorld(size=self.env.size)
           sim_env.grid = self.env.grid.copy()
           sim_env.agent_pos = state
           while node.is_fully_expanded(sim_env.get_valid_actions(node.state)) and node.children:
               action, _ = node.best_child()
               node = node.children[action]
               sim_env.agent_pos = node.state
           valid_actions = sim_env.get_valid_actions(node.state)
           if valid_actions and not node.is_fully_expanded(valid_actions):
               untried = [a for a in valid_actions if a not in node.children]
               action = random.choice(untried)
               next_state, _, _ = sim_env.step(action)
               child = MCTSNode(next_state, parent=node)
               node.children[action] = child
               node = child
           total_reward = 0
           depth = 0
           while depth

We construct the Monte Carlo Tree Search (MCTS) agent to simulate and plan multiple potential future outcomes. We see how it builds a search tree, expands promising branches, and backpropagates results to refine decisions. This allows the agent to plan intelligently before acting. Check out the FULL CODES here.

def train_agent(agent, env, episodes=500, max_steps=100, agent_type="standard"):
   rewards_history = []
   for episode in range(episodes):
       state = env.reset()
       total_reward = 0
       for step in range(max_steps):
           valid_actions = env.get_valid_actions(state)
           if agent_type == "mcts":
               action = agent.search(state)
           else:
               action = agent.get_action(state, valid_actions)
           next_state, reward, done = env.step(action)
           total_reward += reward
           if agent_type != "mcts":
               valid_next = env.get_valid_actions(next_state)
               agent.update(state, action, reward, next_state, valid_next)
           state = next_state
           if done:
               break
       rewards_history.append(total_reward)
       if hasattr(agent, 'decay_epsilon'):
           agent.decay_epsilon()
       if (episode + 1) % 100 == 0:
           avg_reward = np.mean(rewards_history[-100:])
           print(f"Episode {episode+1}/{episodes}, Avg Reward: {avg_reward:.2f}")
   return rewards_history


if __name__ == "__main__":
   print("=" * 70)
   print("Problem Solving via Exploration Agents Tutorial")
   print("=" * 70)
   env = GridWorld(size=8, n_obstacles=10)
   agents_config = {
       'Q-Learning (ε-greedy)': (QLearningAgent(), 'standard'),
       'UCB Agent': (UCBAgent(), 'standard'),
       'MCTS Agent': (MCTSAgent(env, n_simulations=30), 'mcts')
   }
   results = {}
   for name, (agent, agent_type) in agents_config.items():
       print(f"\nTraining {name}...")
       rewards = train_agent(agent, GridWorld(size=8, n_obstacles=10),
                             episodes=300, agent_type=agent_type)
       results[name] = rewards
   plt.figure(figsize=(12, 5))
   plt.subplot(1, 2, 1)
   for name, rewards in results.items():
       smoothed = np.convolve(rewards, np.ones(20)/20, mode="valid")
       plt.plot(smoothed, label=name, linewidth=2)
   plt.xlabel('Episode')
   plt.ylabel('Reward (smoothed)')
   plt.title('Agent Performance Comparison')
   plt.legend()
   plt.grid(alpha=0.3)
   plt.subplot(1, 2, 2)
   for name, rewards in results.items():
       avg_last_100 = np.mean(rewards[-100:])
       plt.bar(name, avg_last_100, alpha=0.7)
   plt.ylabel('Average Reward (Last 100 Episodes)')
   plt.title('Final Performance')
   plt.xticks(rotation=15, ha="right")
   plt.grid(axis="y", alpha=0.3)
   plt.tight_layout()
   plt.show()
   print("=" * 70)
   print("Tutorial Complete!")
   print("Key Concepts Demonstrated:")
   print("1. Epsilon-Greedy exploration")
   print("2. UCB strategy")
   print("3. MCTS-based planning")
   print("=" * 70)

We train all three agents in our grid world and visualize their learning progress and performance. We analyze how each strategy, Q-Learning, UCB, and MCTS, adapts to the environment over time. Finally, we compare results and gain insights into which exploration approach leads to faster, more reliable problem-solving.

In conclusion, we successfully implemented and compared three exploration-driven agents, each demonstrating a unique strategy for solving the same navigation challenge. We observe how epsilon-greedy enables gradual learning through randomness, UCB balances confidence with curiosity, and MCTS leverages simulated rollouts for foresight and planning. This exercise helps us appreciate how different exploration mechanisms influence convergence, adaptability, and efficiency in reinforcement learning.

Check out the FULL CODES here. Feel free to check out our GitHub Page for Tutorials, Codes and Notebooks. Also, feel free to follow us on Twitter and don’t forget to join our 100k+ ML SubReddit and Subscribe to our Newsletter. Wait! are you on telegram? now you can join us on telegram as well.

Asif Razzaq is the CEO of Marktechpost Media Inc.. As a visionary entrepreneur and engineer, Asif is committed to harnessing the potential of Artificial Intelligence for social good. His most recent endeavor is the launch of an Artificial Intelligence Media Platform, Marktechpost, which stands out for its in-depth coverage of machine learning and deep learning news that is both technically sound and easily understandable by a wide audience. The platform boasts of over 2 million monthly views, illustrating its popularity among audiences.

🙌 Follow MARKTECHPOST: Add us as a preferred source on Google.