Gig Economy Workforce Scheduling with Reinforcement Learning

Gig economy workers are typically work on a contract, potentially temporary and called to work on as needed basis. Some examples are delivery service, app based taxi service, content creation and low level administrative work. . A company may have a pool of gig workers. On a given day based on demand forecast they might need certain number of workers. How to decided which workers to call from the pool, that’s most beneficial to the company. It’s complex decision making problem. We are going to find out in this most how a type of Reinforcement Learning (RL) called Multi Arm Bandit (MAB) can effectively solve this decision making problem.

The Python implementation is available in my OSS GitHub repository avenir. The use case for the solution is a fictitious food delivery service.

Reinforcement Learning

Reinforcement Learning is similar to supervised learning except it applies for decision making in uncertain environment and instead of a label you have a numerical feedback, which is often delayed. it’s characterized with the following

State of the system
An action taken in some state
Following the action a reward is received from the environment after some time and the state changes.

Multi Arm Bandit is a simpler version of RL where is no state involved. There is only action and reward. This old post of mine has a good review of various MAB algorithms. For the use case in this post we are using the UCB1 MAB algorithm. In the score function for UCB1 , the first term corresponds to exploitation and the second to exploration.

The MAB algorithms like any RL algorithms engage in exploration i.e. explore actions that have not selected much to find rewarding actions and exploration i.e leverage actions that have already been found to be more rewarding. Performance of these modes is defined by long term regret, which is the cumulative loss for not choosing the most optimal action at any step. The training of these models can me offline and online or online only.

The cycle of using MAB goes as follows repeatedly. It runs continuously and learns along the way, so the decisions are better as time goes. Since these models learn continuously they can handle any non stationarity in the environment. Each action is independent and the reward distribution for each action is different and not necessarily stationary.

Query the model for the next decision
Take the necessary action
When reward is received for the action feed it to the model. This allows the model to learn.
Go back and repeat

Workforce Scheduling

The company has a demand forecasting model and using that knows how many workers it’s going to hire on that day. Everyday in the morning, they use the MAB model to recommend workers to hire. One key challenge for modeling this kind of problem is the appropriate reward function definition. For our use case the reward will depend on the following

How the worker responds to to message asking them to come to work. They may respond positive or negative or may not respond at all
The income on that day.
The average rating of the worker from customers.

I have given more weights to the first and the third parameter., because a worker may not have much control over how much they earn on a day.

The business will perform the following actions every day for model driven workforce scheduling

In early morning, query the model to select all the workers to be called to work on that day
At the end of the day, calculate reward for each worker and provide that information to the model

Implementation and Results

The python UCB implementation has one method to get the next action and one to set the reward for an action. The method for getting the next action calculates a score for each action and returns the one with the highest score There are methods to checkpoint and restore the model. it also supports an algorithm called tuned UCB which takes the variance of reward into account.

The driver code has a command line loop with three menu choices schedule, process and exit. This can be used as an example for your particular use case. The operation schedule and process should be executed alternately. Here is some output for schedule. the scheduled flag indicate whether a worker responded positively when asked to report to work. You can also browse the log file which has lot of useful information on the run time behavior of the model

worker A3G24LR26L  scheduled True
worker 9QHC8TU48K  scheduled False
worker 18G3ZIUCMM  scheduled True
worker J70AUORYM8  scheduled True
worker E2R1IODHJZ  scheduled True
worker SJCWBAF3XQ  scheduled True
worker FGUV2UN5F4  scheduled False

Here is some output of processing the day’s work. For each worker there is a calculated score which is reported to the model as the reward for selecting that worker.

worker A3G24LR26L  score 0.660
worker 9QHC8TU48K  score 0.160
worker 18G3ZIUCMM  score 0.704
worker J70AUORYM8  score 0.745
worker E2R1IODHJZ  score 0.629
worker SJCWBAF3XQ  score 0.830
worker FGUV2UN5F4  score 0.160

Since the model runs continuously it can handle any non stationarity in the reward distribution. Non stationarity is simulated by shifting various reward related distributions at random times. Please refer to the tutorial for details on how to run the use case.

If you examine the score components in the log file, you will find that initially the second component that corresponds to exploration is higher than the first. As you go though many cycles first component corresponding to exploration stars dominating, indicating that the model has transitioned into a exploitation mode, reaping the benefits of actions that have been found to be more rewarding.

Wrapping Up

We have seen Multi Arm Bandit models can be used to solve real life business problems that involves decision making under uncertain conditions. This classic MAB solutions are useful for solving many business decision making problems. if your problem involves a state, the MAB solutions are not appropriate. You can used contextual MAB or other RL algorithms. For complex use cases Deep Reinforcement Learning based on neural model is the way to go.

Gig Economy Workforce Scheduling with Reinforcement Learning