7 Things to Think About When Developing Reinforcement Learners

Albeit we have made very good progress in reinforcement learning research, a unified framework to compare the algorithms is missing. Furthermore, reported metrics in research papers do not give enough information. Here, we discuss potential things to look at to make the analysis more rigorous.

Marin Vlastelica Pogančić

Mar 23 ·5min read

We all know how reinforcement learning paper mostly works. Researcher A publishes an algorithm B, algorithm B outperforms a subset of other “state-of-the-art” algorithms, on a strategically chosen subset of environments which coincidentally work well for the algorithm. In addition, the authors may or may not optimize the hyperparameters of the baselines, but in turn, report the best runs for algorithm B.

Not to get into detail about what has caused this research trend, we can do something to improve it. Proper evaluation metrics of algorithms (in addition to proper benchmarks) are essential in order to have a valid comparison. Mostly what researchers use in order to evaluate the performance of algorithms is the mean performance across runs, if you get lucky, then they would even report the median which is perhaps a bit more informative. Although sounding a bit cynical regarding RL research, I have to say that RL is bread and butter compared to the things that happen in other subfields of machine learning, such as not even having multiple runs of the algorithm (vision people, I am talking to you :) ).

Nj6rqia.png!web

https://uscresl.github.io/humanoid-gail/

Hence, this post is about what do we have to look at to compare one reinforcement learning algorithm to another, a great source of inspiration is [1], where the authors suggest a concrete way of calculating various metrics for RL algorithms, but we’ll look at it more from a top-level perspective since the intricacies are just technical details of the greater goal.

Most intuitively, when one develops an algorithm, you should look at how sensitive it is to various factors during the training procedure, such as the random seed and hyperparameters. Less variability means that the algorithm is more stable, robust, reliable etc. Except for general variability, we want to look at the worst case of different things, i.e. when having the metric, what is it in the lower tail of the distribution. No wonder that the authors from [1] took inspiration from finance in order to define concrete metrics since it turns out that we care about risk and variability in RL also. All in all, the different “reliability” categories can be separated as follows:

IFZ7veB.jpg!web

http://www.iri.upc.edu/files/scidoc/2168-Learning-cloth-manipulation-with-demonstrations.pdf

1. Variability during Training within Rollouts

Ideally, we would like to have continuous, monotonous improvement. Meaning that the average performance should increase with each rollout and within rollouts and that the performance shouldn’t get (significantly) worse from rollout to rollout. This is unfortunately mostly the case, that RL algorithms tend to be unstable. The source of the variability though can be the environment, so what you want to do is to account for the stochasticity in the environment also and adjust for it in the metric. Ideally, you would

want to obtain the best performance at the end of the training run, not somewhere in the middle.

2. Variability across Different Training Runs

The initial conditions of the training shouldn’t influence the algorithm’s performance significantly, this is why it is important to look at different random seeds in different training runs (vision people, I am looking at you!). Sensitivity to hyperparameters should also be accounted for.

3. Variability across Rollouts in Evaluation

We would like the algorithm to produce similar performance and behavior in evaluation. This shows how the algorithm deals with the stochasticity of the environment and different initialization conditions. One also must take into account though that the maximum achievable performance within the rollout can depend on the initial state, should also be taken into account.

4. Short-term Risk within Training Rollouts

The algorithm should exhibit some guarantees in worst-case performance. This is especially important in situations with safety considerations during training. In the short-term case, we want the algorithm’s performance not to wiggle too much, locally. Looking at the risk effectively means looking at the expected value of the lowest tail of the (local) distribution, below a certain percentile (let's say 5%).

5. Long-term Risk within Training Rollouts

Looking at the whole rollout, we want to close the gap between the worst and best performance within the rollout. In comparison to the short-term case, here we would fit the distribution based on the whole rollout. The expected value of the performance in the worst performance that we rarely obtain within the rollout, but it is possible. Obviously, again, this can come from the instability of the algorithm, but also the characteristics of the environment.

6. Risk across Training Runs

In contrast to the 1. point where we look at the variability after discarding outliers, here we want to see what happens with low probability, that we get a really bad seed or with a really bad set of hyperparameters.

7. Risk across Rollouts at Evaluation

In contrast to the 3. point, we look at the worst-case performance across many rollouts in evaluation. Again, the source of the variability can be the algorithm but also the environment.

1. Variability during Training within Rollouts

2. Variability across Different Training Runs

3. Variability across Rollouts in Evaluation

4. Short-term Risk within Training Rollouts

5. Long-term Risk within Training Rollouts

6. Risk across Training Runs

7. Risk across Rollouts at Evaluation

Recommend

Implementing single-file Web Components

JavaScript(7)--- 继承

Rust Async and the Terrible, Horrible, No Good, Very Bad Day

GitHub - MTrajK/virus-spreading: Simple virus spreading simulation tool made wit...

扎克伯格对话美国传染病专家：中国封闭隔离非常有效

Echo系列教程 — 定制篇5：自定义 HTTP Error Handler，让 HTTP 错误处理更友好

The Fallacy of Move Fast and Break Things

理解是智能的前提，但什么是理解？

SpringBoot的启动流程是怎样的？SpringBoot源码（七）

一文了解 Kubernetes 中的服务发现

About Joyk