18

7 Things to Think About When Developing Reinforcement Learners

 4 years ago
source link: https://towardsdatascience.com/7-things-to-think-about-when-developing-reinforcement-learners-71e6fa5434d9
Go to the source link to view the article. You can view the picture content, updated content and better typesetting reading experience. If the link is broken, please click the button below to view the snapshot at that time.
neoserver,ios ssh client

Albeit we have made very good progress in reinforcement learning research, a unified framework to compare the algorithms is missing. Furthermore, reported metrics in research papers do not give enough information. Here, we discuss potential things to look at to make the analysis more rigorous.

We all know how reinforcement learning paper mostly works. Researcher A publishes an algorithm B, algorithm B outperforms a subset of other “state-of-the-art” algorithms, on a strategically chosen subset of environments which coincidentally work well for the algorithm. In addition, the authors may or may not optimize the hyperparameters of the baselines, but in turn, report the best runs for algorithm B.

Not to get into detail about what has caused this research trend, we can do something to improve it. Proper evaluation metrics of algorithms (in addition to proper benchmarks) are essential in order to have a valid comparison. Mostly what researchers use in order to evaluate the performance of algorithms is the mean performance across runs, if you get lucky, then they would even report the median which is perhaps a bit more informative. Although sounding a bit cynical regarding RL research, I have to say that RL is bread and butter compared to the things that happen in other subfields of machine learning, such as not even having multiple runs of the algorithm (vision people, I am talking to you :) ).

Nj6rqia.png!web

https://uscresl.github.io/humanoid-gail/

Hence, this post is about what do we have to look at to compare one reinforcement learning algorithm to another, a great source of inspiration is [1], where the authors suggest a concrete way of calculating various metrics for RL algorithms, but we’ll look at it more from a top-level perspective since the intricacies are just technical details of the greater goal.

Most intuitively, when one develops an algorithm, you should look at how sensitive it is to various factors during the training procedure, such as the random seed and hyperparameters. Less variability means that the algorithm is more stable, robust, reliable etc. Except for general variability, we want to look at the worst case of different things, i.e. when having the metric, what is it in the lower tail of the distribution. No wonder that the authors from [1] took inspiration from finance in order to define concrete metrics since it turns out that we care about risk and variability in RL also. All in all, the different “reliability” categories can be separated as follows:

IFZ7veB.jpg!web

http://www.iri.upc.edu/files/scidoc/2168-Learning-cloth-manipulation-with-demonstrations.pdf

1. Variability during Training within Rollouts

Ideally, we would like to have continuous, monotonous improvement. Meaning that the average performance should increase with each rollout and within rollouts and that the performance shouldn’t get (significantly) worse from rollout to rollout. This is unfortunately mostly the case, that RL algorithms tend to be unstable. The source of the variability though can be the environment, so what you want to do is to account for the stochasticity in the environment also and adjust for it in the metric. Ideally, you would

want to obtain the best performance at the end of the training run, not somewhere in the middle.

2. Variability across Different Training Runs

The initial conditions of the training shouldn’t influence the algorithm’s performance significantly, this is why it is important to look at different random seeds in different training runs (vision people, I am looking at you!). Sensitivity to hyperparameters should also be accounted for.

3. Variability across Rollouts in Evaluation

We would like the algorithm to produce similar performance and behavior in evaluation. This shows how the algorithm deals with the stochasticity of the environment and different initialization conditions. One also must take into account though that the maximum achievable performance within the rollout can depend on the initial state, should also be taken into account.

4. Short-term Risk within Training Rollouts

The algorithm should exhibit some guarantees in worst-case performance. This is especially important in situations with safety considerations during training. In the short-term case, we want the algorithm’s performance not to wiggle too much, locally. Looking at the risk effectively means looking at the expected value of the lowest tail of the (local) distribution, below a certain percentile (let's say 5%).

5. Long-term Risk within Training Rollouts

Looking at the whole rollout, we want to close the gap between the worst and best performance within the rollout. In comparison to the short-term case, here we would fit the distribution based on the whole rollout. The expected value of the performance in the worst performance that we rarely obtain within the rollout, but it is possible. Obviously, again, this can come from the instability of the algorithm, but also the characteristics of the environment.

6. Risk across Training Runs

In contrast to the 1. point where we look at the variability after discarding outliers, here we want to see what happens with low probability, that we get a really bad seed or with a really bad set of hyperparameters.

7. Risk across Rollouts at Evaluation

In contrast to the 3. point, we look at the worst-case performance across many rollouts in evaluation. Again, the source of the variability can be the algorithm but also the environment.


About Joyk


Aggregate valuable and interesting links.
Joyk means Joy of geeK