Solving Multi Armed Bandits (MAB) problem via ε-greedy agents

Solving Multi-Armed Bandits (MAB) problem via ε-greedy agents

In this article, We’ll design a Multi-Armed Bandit problem (as described in Reinforcement Learning: An Introduction — Sutton [1]) & analyze how ε-greedy agents attempt to solve the problem.

Ashik Poovanna

May 12 ·7min read

(This is my implementation & revision of concepts from Chapter-2 of [1])

What are Multi-Armed Bandits?

NjeQNzM.png!web

A Slot Machine (1- armed bandit)

Before we talk about MABs, we need to understand what a 1-armed bandit is.

A Slot machine that you see in a typical casino is another name for the 1-armed bandit. The name 1-armed comes from the lever/arm that one needed to pull to stop the spinning reels (Nowadays buttons to do the job) and bandits cause they loot you, duh!

These Slot machines have an RNG (Random Number Generator) chip implanted in the machine that generates a random combination of symbols. So a gambler invests some amount in the machine and plays the game. Some rare combinations are rewarded while most are not. All the payouts are done while maintaining an average payback percentage in the long run.

Should I gamble on slots [2], [3], [4]?

Let's assume that the bandit has an average payback percentage = x % . The RNG chip is designed to maintain this average over its lifetime. The total number of possible combinations and the winning combinations are selected accordingly.
The bandit returns x% of the total money invested on the machine. It holds (100-x)% . That is the casino’s cut. To maximize its profits the casino traps more and more people into playing the game. The increase in investments results in increased profits.
Ex- If there is a slot machine with a 90% payback rate (or a 10% hold rate), then a total investment of 100$ on that machine will result in the bandit making a profit of only 10$. But if the total investment is 1,000,000$ then the bandit will end up looting 100,000$ in any case.
Naive Gamblers end up losing money on bandits cause they assume the machine to be sampling without replacement. So they think that by playing the same machine multiple times in succession they increase their chance of winning significantly. They end up withdrawing more and more money and fall prey to these ruthless bandits. Sadly, this is not the case since at each play the probability of getting a win remains the same.

Now MABs (or k-armed bandits) can be thought of as multiple slot machines with different payback percentages or as a single slot machine with k-arms and each arm having a different payback percentage. The aim of a player/agent is to come up with a policy (In this case it is non-associative) for choosing an action (here the action is to decide which arm to pull) at each time step (t) such that the total reward at the end of some fixed time period (T) is maximized.

The MAB problem is a subclass of a more general class of Stochastic Scheduling problems [5] that deals with optimal resource allocations. Also its a single state MDP (Markov Decision process) problem.

Real-World Applications of MABs can be read here [6].

Designing the experiment

We are going to make some assumptions to simplify the experiment -

We’ll assume that the k-armed bandits are stationary . What this essentially means is that the payout rates do not change over time.
We’re going to consider the problem in a non-associative setting i.e actions are independent of state.
Let the number of arms (k) =10. The action here is deciding which arm to pull (0–9).
Some Notations : t - timestep, a - action (0–9), q*(a) - True value of the action (i.e the average reward associated with action(a)), Q_t(a) - Estimated action-value for action(a) at timestep (t), R_t - Rewards, A_t - Action taken at time (t).

To simulate the workings of the MAB problem, [1] suggests sampling the true average action-value for all arms(actions) from a Standard Normal Distribution N(0,1). So the arm having a higher average action-value leads to better payouts in the long run. These values are hidden from the agents/players. To solve the problem we have to figure this out by playing the bandits multiple times. What is available to the agents are the immediate rewards and these are sampled from Normal distributions N(q*(a),1) for the respective arms/actions.

rEZfAfI.jpg!web

Example Reward Distributions for the 10 arms (White dots represent the respective q*(a)s)

What makes this problem non-trivial is that at any point in time we don’t know what the true average action-values (q*(a)) for all actions are. If we knew that we could just choose the action (a) having the max q*(a). So we need to come up with some ways to estimate q*(a) at each time step.

Solving the problem(Agents)

One way to solve the problem is via a Pure Greedy Action Selection Method . In this method, the agent always exploits the current knowledge (Q_t(a)s) to maximize the immediate reward (as shown in the below equation).

A_t = argmax(Q_t(a)) , ∀ a ∈ A — (1)

In Action-value estimation methods, the aim is to estimate the true action-values for each action (i.e Q_t(a)) and using this to decide which action to take in the given time step using equation (1).

Now according to [1] for the Stationary k-armed bandit problems , action-value estimation can be done using the two methods described below-

Sample Average Method (SAM) : Notice that here you require extra space to store past rewards for average calculations at each time step.

eE36fyv.jpg!web

2. Incremental Update method (IUM) : It is a Mathematical Optimization of the previous SAM. Notice how the equation below is memory efficient since you don’t need to store history of past rewards and there is no need for average re-calculations at each time step. (I’ve used this in my code)

E7VnIfv.jpg!web

A point to note is that all the above-mentioned methods are biased by the initial selection of action-values. This bias allows one to provide prior info about what kinds of rewards are to be expected. Ex- optimistic bias (high initial action-values) encourages exploration earlier on & then decreases it . The downside of bias is that it becomes a parameter that needs to be set by the user.

The methods presented above creates a dilemma. If you solely chose the action based on the action-value estimates you will end up with what we call a pure greedy agent (The pure greedy agent always exploits the current knowledge to maximize its immediate reward). But now the question arises what if there was another untried arm that could have led to better total reward in the end but ended up not being selected (or explored) at the start due to lower initial action-values. This is where ε-greedy agents come in. They try to handle this dilemma.

From [1] ε-greedy algorithm

As described in the figure above the idea behind a simple ε-greedy bandit algorithm is to get the agent to explore other actions randomly with a very small probability (ε) while at other times you go with selecting the action greedily. It can be asymptotically shown that the estimated action values converge to true action-values, but the practical effectiveness of the algorithm itself is unknown.

Another alternative to the dilemma of explorations v/s exploitation is to use the Upper Confidence Bound (UCB) Action-Selection Method . In this approach during exploration instead of randomly selecting from available action space, UCB takes into account the uncertainties (with (c>0) determining the confidence levels of the estimate that decide the degree of exploration) in those estimates. But this method is not practically found to scale well beyond the MAB problem.

n26ZRzQ.jpg!web

Code

Install the k-armed bandits environment .
Head over to reinforcei . This repo contains the code for the agents and their respective demos.

Solution Analysis

BZrqYfn.png!web

The plot above compares the performance of different agents on the same 10-armed bandit problem. The statistics are obtained by averaging over 1000 such runs.

Some other inferences that hold for asymptotic/more episodes, different experimental settings & might not be visible here:

ε=0.01 is bound to perform better than ε=0.1 in the long run. And both perform better than ε=0.
Instead of static (ε) designing the (ε) to reduce/decay over time leads to better performance.
If rewards are noisier i.e variance in the sampling distributions are increased ε-greedy agents do way better than pure-greedy agents. Whereas if we were to reduce the variance to 0 pure greedy would be the way to go since at each time step the estimated q(a) would equal true q*(a).
High initial biases are useful to encourage exploration in the early stages of stationary MAB problems.
UCB performs better than epsilon-greedy for stationary MAB problem.

Side Note

Thank you for reading the article. I hope it was useful. Any corrections or advice is welcome. I hope to improve my understandings of RL from the grass-root levels. I plan to cover Non-Stationary MAB and Gradient Bandit Algorithm in the next article.