make smart choices when it starts out knowing nothing and can only learn through trial and error?
This is exactly what one of the simplest but most important models in reinforcement learning is all about:
A multi-armed bandit is a simple model for learning by trial and error.
Just like we do.
We’ll explore why the decision between trying something new (exploration) and sticking to what works (exploitation) is trickier than it seems. And what this has to do with AI, online ads and A/B testing.
Why is it important to understand this concept?
The multi-armed bandit introduces one of the core dilemmas of reinforcement learning: How to make good decisions under uncertainty.
It is not only relevant for AI, data science and behavioral models, but also because it reflects how we humans learn through trial and error.
What machines learn by trial and error is not so different from what we humans do intuitively.
The difference?
Machines do it in a mathematically optimized way.
Let’s imagine a simple example:
We are standing in front of a slot machine. This machine has 10 arms and each of these arms has an unknown chance of winning.
Some levers give higher rewards, others lower ones.
We can pull the levers as often as we like, but our goal is to win as much as possible.
This means that we have to find out which arm is the best (= yields the most profit) without knowing from the start which one it is.
The model is very reminiscent of what we often experience in everyday life:
We test out different strategies. At some point, we use the one that brings us the most pleasure, enjoyment, money, etc. Whatever it is that we are aiming for.
In behavioral psychology, we speak of trial-and-error learning.
Or we can also think of reward learning in cognitive psychology: Animals in a laboratory experiment find out over time at which lever there is food because they get the greatest gain at that particular lever.
Now back to the concept of multi-armed bandits:
It serves as an introduction to decision-making under uncertainty and is a cornerstone for understanding reinforcement learning.
I wrote about reinforcement learning (RL) in detail in the last article “Reinforcement Learning Made Simple: Build a Q-Learning Agent in Python”. But at its core, it’s about an agent learning to make good decisions through trial and error. It is a subfield of machine learning. The agent finds itself in an environment, decides on certain actions and receives rewards or penalties for them. The goal of the agent is to develop a strategy (policy) that maximizes the long-term overall benefit.
So we have to find out in the multi-armed bandits:
- Which levers are worthwhile in the long term?
- When should we exploit a lever further (exploitation)?
- When should we try out a new lever (exploration)?
These last two questions leads us directly to the central dilemma of reinforcement learning:
Central dilemma in Reinforcement Learning: Exploration vs. Exploitation
Have you ever held on to a good option? Only to find out later that there’s a better one? That’s exploitation winning over exploration.
This is the core problem of learning through experience:
- Exploration: We try something new in order to learn more. Maybe we discover something better. Or maybe not.
- Exploitation: We use the best of what we have learned so far. With the aim of gaining as much reward as possible.
The problem with this?
We never know for sure whether we have already found the best option.
Choosing the arm with the highest reward so far means relying on what we know. This is called exploitation. However, if we commit too early to a seemingly good arm, we may overlook an even better option.
Trying a different or rarely used arm gives us new information. We gain more knowledge. This is exploration. We might find a better option. But it could also be that we find a worse option.
That is the dilemma at the heart of reinforcement learning.

What we can conclude from this:
If we only exploit too early, we may miss out on the better arms (here arm 3 instead of arm 1). However, too much exploration also leads to less overall yield (if we already know that arm 1 is good).
Let me explain the same thing again in non-techy language (but somewhat simplified):
Let’s imagine we know a good restaurant. We’ve gone to the same restaurant for 10 years because we like it. But what if there is a better, cheaper place just around the corner? And we have never tried it? If we never try something new, we’ll never find out.
Interestingly, this isn’t just a problem in AI. It is well known in psychology and economics too:
The exploration vs. exploitation dilemma is a prime example of decision-making under uncertainty.
The psychologist and Nobel Prize winner Daniel Kahnemann and his colleague Amos Tversky have shown that people often do not make rational decisions when faced with uncertainty. Instead, we follow heuristics, i.e. mental shortcuts.
These shortcuts often reflect either habit (=exploitation) or curiosity (=exploration). It is precisely this dynamic that is also visible in the Multi-Armed Bandit:
- Do we play it safe (=known arm with high reward)
or - do we risk something new (=new arm with unknown reward)?
Why does this matter for reinforcement learning?
We face the dilemma between exploration vs. exploitation everywhere in reinforcement learning (RL).
An RL agent must constantly decide whether it should stick with what has worked best so far (=exploitation) or should try something new to discover even better strategies (=exploration).
You can see this trade-off in action in recommendation systems: Should we keep showing users content they already like or risk suggesting something new they might love?
And what strategies are there to select the best arm? Action selection strategies
Action selection strategies determine how an agent decides which arm to select in the next step. In other words, how an agent deals with the exploration vs. exploitation dilemma.
Each of the following strategies (also policies/rules) answers one simple question: How do we choose the next action when we don’t know for sure what’s best?
Strategy 1 – Greedy
This is the simplest strategy: We always choose the arm with the highest estimated reward (= the highest Q(a)). In other words, always go for what seems best right now.
The advantage of this strategy is that the reward is maximized in the short term and that the strategy is very simple.
The disadvantage is that there is no exploration. No risk is taken to try something new, because the current best always wins. The agent might miss better options that simply haven’t discovered yet.
The formal rule is as follows:

Let’s have a look at a simplified example:
Imagine we try two new pizzerias. And the second one is quite good. From then on, we only go back to that one, even though there are six more we’ve never tried. Maybe we’re missing out on the best Pizzas in town. But we’ll never know.
Strategy 2 – ε-Greedy:
Instead of always picking the best-known option, we allow in this strategy some randomness:
- With probability ε, we explore (try something new).
- With probability 1-ε, we exploit (stick with the current best).
This strategy deliberately mixes chance into the decision and is therefore practical and often effective.
- The higher ε is chosen, the more exploration happens.
- The lower ε is chosen, the more we exploit what we already know.
For example, if ε = 0.1, exploration occurs in 10% of cases, while exploitation occurs in 90% of cases.
The advantage of ε-Greedy is that it is easy to implement and provides good basic performance.
The disadvantage is that choosing the right ε is difficult: If ε is chosen too large, a lot of exploration takes place and the loss of rewards can be too great. If ε is too small, there is little exploration.
If we stay with the pizza example:
We roll the dice before every restaurant visit. If we get a 6, we try out a new pizzeria. If not, we go to the regular pizza.
Strategy 3 – Optimistic Initial Values:
The point in this strategy is that all Q0(a) start with artificially high values (e.g. 5.0 instead of 0.0). At the beginning, the agent assumes all options are great.
This encourages the agent to try everything (exploration). It wants to disprove the high initial value. As soon as an action has been tried, the agent sees that it is worth less and adjusts the estimate downwards.
The advantage of this strategy is that exploration occurs automatically. This is particularly suitable in deterministic environments where rewards do not change.
The disadvantage is that the strategy works poorly if the rewards are already high.
If we look at the restaurant example again, we would rate each new restaurant with 5 stars at the beginning. As we try them, we adjust the ratings based on real experience.
To put it simply, Greedy is pure habitual behavior. ε-Greedy is a mixture of habit and curiosity behavior. Optimistic Initial Values is comparable to when a child initially thinks every new toy is great – until it has tried it out.
On my Substack Data Science Espresso, I regularly share practical guides and bite-sized updates from the world of Data Science, Python, AI, Machine Learning and Tech — made for curious minds like yours. Have a look — and subscribe if you want to stay in the loop.
How the agent learns which options are worthwhile: Estimating Q-values
For an agent to make good decisions, it must estimate how good each individual arm is. It needs to find out which arm will bring the highest reward in the long term.
However, the agent does not know the true reward distribution.
This means the agent must estimate the average reward of each arm based on experience. The more often an arm is drawn, the more reliable this estimate becomes.
We use an estimated value Q(a) for this:
Q(a) ≈ expected reward if we choose arm a
Our aim here is for our estimated value Qt(a) to get better and better. So good until it comes as close as possible to the true value q∗(a):

The agent wants to learn from his experience in such a way that his estimated valuation Qt(a) corresponds in the long run to the average profit of arm a in the long term.
Let’s look again at our simple restaurant example:
We imagine that we want to find out how good a particular café is. Every time we go there, we get some feedback by giving it 3, 4 or 5 stars, for example. Our goal is that the perceived average will eventually match the real average that we would get if we went infinitely often.
There are two basic ways in which an agent calculates this Q value:

Method 1 – Sample average method
This method calculates the average of the observed rewards and is actually as simple as it sounds.
All previous rewards for this arm are looked at and the average is calculated.

- n: Number of times arm a was chosen
- Ri: Reward on the i-th time
The advantage of this method is that it is simple and intuitive. And it is statistically correct for stable, stationary problems.
The disadvantage is that it reacts too slowly to changes. Especially in non-stationary environments, where conditions shift over time.
For example, imagine a music recommendation system: A user might suddenly develop a new taste. The user used to prefer rock, but now they listen to jazz. If the system keeps averaging over all past preferences, it reacts very slowly to this change.
Similarly, in the mult-armed bandit setting, if arm 3 suddenly starts giving much better rewards from round 100 onwards, the running average will be too sluggish to reflect that. The early data still dominates and hides the improvement.
Method 2 – Incremental Implementation
Here the Q value is adjusted immediately with each new reward – without saving all previous data:

- α: Learning rate (0 < αalphaα ≤ 1)
- Rn: Newly observed reward
- Qn(a): Previous estimated value
- Qn+1: Updated estimated value
If the environment is stable and rewards don’t change, the sample average method works best. But if things change over time, the incremental method with a constant learning rate α adapts more quickly.

Final Thoughts: What do we need it for?
Multi-armed bandits are the basis for many real-world applications such as recommendation engines or online advertising.
At the same time, it’s the perfect stepping stone into reinforcement learning. It teaches us the mindset: Learning through feedback, acting under uncertainty and balancing exploration and exploitation.
Technically, multi-armed bandits are a simplified form of Reinforcement Learning: There are no states, no future planning, but only the rewards right now. But the logic behind them shows up again and again in advanced methods like Q-learning, policy gradients, and deep reinforcement learning.
Curious to go further?
On my Substack Data Science Espresso, I share guides like this one. Breaking down complex AI topics into digestible, practicable steps. If you enjoyed this, subscribe here to stay in the loop.