Online Advertisement Placement with Contextual Bandit.
source link: https://pkghosh.wordpress.com/2023/07/27/online-advertisement-placement-with-contextual-bandit/
Go to the source link to view the article. You can view the picture content, updated content and better typesetting reading experience. If the link is broken, please click the button below to view the snapshot at that time.
Online Advertisement Placement with Contextual Bandit.
There is a class of algorithms called Multi Arm Bandit (MAB) applicable for Reinforcement Leaning problems without state. Sometimes there is side information available for each action. MAB algorithms that take this side information aka context into account are called Contextual Bandit. The decision made by a Contextual Bandit at any given time is a function of the context in addition to past rewards. In this post we will use a Contextual Bandit algorithms called Linear Payoff Upper Confidence Bound (LinUCB) for making online ad placement decisions.
The implementation is available in my Reinforcement learning Python package called qinsia. Here is the GitHub repo.
Contextual Bandit
In reinforcement Learning (RL) an agent takes an action in a given state of the environment and the agent gets a reward from the environment at a later time and the state changes. The goal of RL is to select actions in such a way to maximize the long term cumulative reward. Multi Arm Bandit(MAB) is applicable to a class of RL problems without state. In Contextual Multi Arm bandit there is a context based on the agent and the environment. The context is not exactly same as state in RL.
The contextual bandit is a continuous decision making process, under uncertain conditions. It’s goal as in any Reinforcement Learning system is to maximize the reward over a time horizon. It runs in a loop as follows making decisions when asked to and getting reward for past decisions taken.
- Pass the context for each action and get the action selected by the model. The action is taken.
- For some past action taken, when the reward is available from the environment, pass the action and reward to the model.The model updates itself based on the reward feedback
The context is dynamic. The context could be same for all actions or different. The context could change with time. The context is generally represented as vector.
There are many contextual bandit algorithms. Here are some examples
- Linear Upper Confidence Bound
- Thompson Sampling with Linear Payoff
- Epoch Greedy
In LinUCB, expected reward for an action is assumed to be linear function of the feature vector for the action. Based on the actual reward received, a linear regression problem is solved. It uses Ridge regularizer aka L2 regularizer. LinUCB also calculates an upper confidence bound for the reward. The LinUCB algorithm selects the action with the highest estimated upper confidence bound of the reward.
Advertisement Placement
In online advertisement, there is a platform through which advertisers and content publishers collaborate. The crux of the problem is matching content with the advertisement to maximize click through. There is a set of advertisements and a set of pages in various websites where the advertisements are place based on contract between advertisers and content publishers. When a page is about to come up in response to user request, a decision needs to be made for the particular advertisement to be placed on the page. The advertisement are the actions for our problem.
The context for advertisements is calculated based on similarity between the advertisement and the page calculated along 4 dimensions. The four similarity parameters for matching are as follows. So the feature vector size is 4 for our example problem. Each similarity value is between 0 and 1, with 1 being maximum similarity. The context vector for this problem is different for each action and it also varies with time.
- Content
- Time of day
- Device type
- Geo location
Advertisement placement optimization is simulated as below in a loop. Please take a look at adpl.py for details. Bandit models start with a clean slate and learns incrementally as it receives reward feedback. Learning is online.
- Choose a site randomly
- Take the features vectors for the combination of that site and all advertisements. A feature vector for an advertise and site combination consists of 4 similarity values
- Change the the geo location similarity randomly. You could also change device type similarity. Time of the day similarity could also be change randomly every few hours
- Pass the feature vector of all advertisements to the contextual bandit model. It returns the advertisement to run. That’s the decision made by the LinUCB model.
- Simulate reward for that advertisement, site combination. Pass the reward to the contextual bandit model. The model updates itself based on the reward feedback. This is where learning takes place
- At the end of a session you have the option to checkpoint the model and then restore the checkpoints model
Please take a look at the tutorial document for details on how to run it. The implementation of LinUCB is in GitHub in case you are curious.
Here is some console output. Each row shows the advertisement selected by the LinUCB model for a given site along with the actual reward obtained.
site site_1 adv adv_5 features 0.981 0.700 0.475 0.464 reward 0.560 site site_6 adv adv_2 features 0.594 0.914 0.461 0.779 reward 0.620 site site_3 adv adv_4 features 0.976 0.813 0.460 0.857 reward 0.802 site site_8 adv adv_1 features 0.973 0.843 0.667 0.645 reward 0.609
You can get more detailed output in the log. Here is some sample. It shows the expected reward upper bound, based which the model selects the action.
2023-07-26 12:12:04,965 - qinisa.cmab - INFO - play count 174 action adv_5 reward upper bound 2.034 2023-07-26 12:12:04,965 - qinisa.cmab - INFO - action adv_5 feature 0.894 0.691 0.460 0.707 actual reward 0.697 2023-07-26 12:12:04,966 - qinisa.cmab - INFO - play count 175 action adv_4 reward upper bound 0.463 2023-07-26 12:12:04,966 - qinisa.cmab - INFO - action adv_4 feature 0.750 0.599 0.678 0.634 actual reward 0.694 2023-07-26 12:12:04,966 - qinisa.cmab - INFO - play count 176 action adv_3 reward upper bound 2.722 2023-07-26 12:12:04,966 - qinisa.cmab - INFO - action adv_3 feature 0.892 0.912 0.425 0.421 actual reward 0.579 2023-07-26 12:12:04,967 - qinisa.cmab - INFO - play count 177 action adv_2 reward upper bound 1.538 2023-07-26 12:12:04,967 - qinisa.cmab - INFO - action adv_2 feature 0.594 0.914 0.461 0.806 actual reward 0.670 2023-07-26 12:12:04,967 - qinisa.cmab - INFO - play count 178 action adv_5 reward upper bound 2.159 2023-07-26 12:12:04,967 - qinisa.cmab - INFO - action adv_5 feature 0.894 0.691 0.460 0.474 actual reward 0.552 2023-07-26 12:12:04,968 - qinisa.cmab - INFO - play count 179 action adv_5 reward upper bound 2.365 2023-07-26 12:12:04,968 - qinisa.cmab - INFO - action adv_5 feature 0.981 0.700 0.475 0.835 actual reward 0.784
Concluding Thoughts
Contextual Bandit in a powerful RL technique suitable for many applications such as recommendation, search, advertisement.
Recommend
About Joyk
Aggregate valuable and interesting links.
Joyk means Joy of geeK