An Intuitive Introduction to Reinforcement Learning

Welcome to the Future of Artificial Intelligence

Rohan Jagtap

Jul 15 ·5min read

jA7vaaI.jpg!web

Photo by Franck V. on Unsplash

Reinforcement Learning is the type of learning that is closest to the way humans learn.

Reinforcement Learning, as opposed to supervised and unsupervised learning techniques, is a goal-oriented learning technique. It is based on operating in an environment wherein a hypothetical person (agent) is expected to take a decision (action) from a set of possible decisions, and to maximize the profit (reward) that is obtained by making that decision, by iteratively learning to select a decision that leads to the desired goal (basically trial and error). I’ll explain this in greater detail as we proceed through the article.

In this article, I’ll be discussing the fundamentals of Reinforcement Learning or RL (with examples wherever possible).

Supervised? Unsupervised? Reinforcement Learning!

First things first! Before even starting to talk about RL, we’ll first see how exactly it differs from supervised and unsupervised learning techniques.

Let’s consider an example of a kid who is learning to ride a bicycle. We’ll see how this problem would be addressed if the kid was to learn in a supervised, unsupervised, or a reinforcement learning way:

Supervised Learning: Now, if the kid starts calculating the force he needs to apply on the pedal, or maybe the angle he needs to maintain with the ground to stay balanced; and he starts to optimize these calculations at every instance of him riding the bicycle to perfect his riding skills, then it would be said that he is learning in a supervised way.
Unsupervised Learning: Whereas, if the kid starts watching thousands of other people riding the bicycle, and based on that knowledge if he starts to figure out what exactly is to be done for riding a bicycle, then it would be said that he learned in an unsupervised way.
Reinforcement Learning: Finally, if he’s given a few options like hitting the pedal, turning the handle left or right, applying brakes, etc. and the freedom to try whatever he wants among these options to be able to ride the bicycle successfully, he’d first do it wrong and fail (maybe fall off); but eventually, after a few failed attempts, he’ll figure out how to do it, and finally succeed. This case is an example of reinforcement learning.

Well, now you know why is it said to be the closest to the way humans learn! Now you can expect the topics to get a little formal as we proceed further.

Exploration v/s Exploitation

Let’s continue with the example of the kid, who knows a set of actions that he can perform to ride a bicycle. So, consider a scenario where he has finally figured out that hitting the pedal continuously would drive the bicycle. However, he doesn’t realize that after riding, he has to stop at some point (i.e. applying brakes at the right time is an integral part of riding a bicycle). But, he’s happy that now he knows how to ride the bicycle and doesn’t care about future events. Let’s call his happiness as ‘ reward ’, meaning that he is rewarded for his action of hitting the pedal. And since he’s being rewarded, he is purely ‘ exploiting ’ the current action, i.e. pedaling, not knowing that maybe in the end he might crash somewhere which would leave him far from achieving his ultimate goal; which is riding the bicycle correctly.

Now, he can ‘ explore ’ other options from the set of available actions instead of just pedaling. Eventually, he’ll be able to stop the bicycle whenever he wants to. In a similar fashion, he’ll learn how to take a turn and in this way, he’d be a good rider.

But, too much of anything is bad! We saw that too much exploitation can lead to failure. In the same way, too much exploration is also bad. For example, if he just randomly changes his actions on every instance, he’ll be nowhere near riding the bike, would he? So basically, it’s a trade-off and it is coined as the Exploration-Exploitation Dilemma and is one of the major parameters to consider while solving an RL problem.

Notethat the kid would decide his action at a given instance on the basis of his current ‘ state ’ w.r.t to the environment i.e. his current motion/position while cycling and the rewards obtained from the previous tries (This decision-making mechanism is what RL is all about).

Building Blocks of an RL Problem

Policy: A policy defines the behavior of an RL agent. In our example, a policy would be the way the kid thinks about what action to choose among the available ones (kid is the agent).
Reward: These define the goal of a problem. At each step, the environment sends a reward to the agent. In our example, the pleasure of riding the bicycle, or the pain of falling off, would be the reward (The second case could be referred to as a penalty).
Value Function: The reward is an immediate response of the environment to the agent. However, we are interested in maximizing the reward in the long-run. This is computed using the value functions. Formally, the value of a state is the total reward an agent can expect to accumulate over the future, starting from that state ( Sutton & Barto ). If the kid thinks it through about what could happen in the future, if he opts to select a particular action, for say, a few hundred meters, then that could be called as the value.
Model: A model of the environment is a tool for planning. It mimics the actual environment and hence can be used to make inferences about how the environment would behave. For example, given a state and an action, the model might predict the resultant next state and next reward ( Sutton & Barto ). And of course, RL mechanisms can be classified into model-based and model-free methods.

Conclusion

We’ve got an intuition about what an RL problem looks like and how one can address it. Moreover, we distinguished RL from supervised and unsupervised learning. However, Reinforcement Learning is way more intricate than the outline laid in this article; but it is enough to clear the fundamental concepts.

References

Sutton and Barto: https://web.stanford.edu/class/psych209/Readings/SuttonBartoIPRLBook2ndEd.pdf

NPTEL RL Course: https://www.youtube.com/watch?v=YaPSPu7K9S0&list=PLyqSpQzTE6M_FwzHFAyf4LSkz_IjMyjD9&index=5

Welcome to the Future of Artificial Intelligence

Supervised? Unsupervised? Reinforcement Learning!

Exploration v/s Exploitation

Building Blocks of an RL Problem

Conclusion

References

Recommend

Are machines going to replace programmers?

20 Pandas Functions That Will Boost Your Data Analysis Process

Natural Language to SQL: Use it on your own database

Energy-based models and the future of generative algorithms

Digit recognizer using CNN

28元红包，我到了前女友婚礼饱餐一顿

AutoML-Zero: How is Google’s new automated ML Algorithm.

Viaweb FAQ

The Fake Cisco

Automatic Schema Synchronization in NDB Cluster 8.0: Performance Schema Tables

About Joyk