During my studies, I took a quick look at Q-learning, which is an algorithm from the reinforcement learning family. Consider you have an agent (or player) which interacts with its environment. The environment is set up as a grid. The agent takes a position within a square. Only vertical and horizontal moves are allowed. Given the agent is not near the edge, four move directions are possible[1].

Each move is weighed with a reward (or punishment). The reward is typically between $-1$ and $1$. We assume our agent is greedy, so he will always make the best possible move - the move with the highest weight/reward.

How can an agent learn, how to find the best route through such a grid?

The agents need to explore its environment. Each move is based on the experience the agent made before. A very naive approach puts the agent randomly on each field and weighs the possible actions. This is, where Q-Learning gets interesting[1:1]:

$$ \begin{equation}Q(s,a)=R(s,a)+\gamma \sum_{s'} P(s'|s,a) \max_{a'} Q(s',a') \end{equation}$$

$s$ is the current state of the agent, $a$ is the action it initiates from this state. A $R$ function tells the agent the reward of an action/move from a given state. $\gamma$ is the discount factor. A small $\gamma$ would result in a preference for short-term rewards, while a large $\gamma$ would also include long-term rewards. $P(s'|s,a)$ describes the possibility to reach the new field. As I did the simulation in the browser, we can assume a deterministic environment. That simplifies the equation a bit[1:2]:

$$ \begin{equation}Q(s,a)=R(s,a)+\gamma \max_{a'} Q(s',a') \end{equation}$$

$Q(s',a')$ is the next state. The temporal difference equations is the single step, we are missing:

$$ \begin{equation}Q(s,a)\leftarrow Q(s,a)+\alpha \left(R(s,a) + \gamma \cdot Q(s',a')-Q(s,a)\right)\end{equation} $$

where $\alpha$ is the learning rate. As we calculate the 'gain' of a single step, we do no longer compare the best possible actions. The single-step function is used for the exploration phase. In our example$\gamma$ is set to 0.5 hardcoded. You can mark a field as an obstacle that makes it harder for the mouse to find a strategy. Feel free to play with it or to contribute:


  1. Russell, Stuart, and Peter Norvig. 2012. Artificial Intelligence: A Modern Approach. Pearson, P. 973-975 ↩︎ ↩︎ ↩︎