Reinforcement Learning

The Future of Artifical Intellgence and Learning

Terminology 

Before getting into the “what is” questions, we first need to clear up some terminology that will be used in this newsletter [1].   

  • The agent refers to the product that is doing the actions.
  • The environment refers to the rules and mechanics the agent must following to reach a goal. 
  • The state refers to the problems or obstacles the agent must face at the current moment.
  • The reward refers to the feedback the agent gets on whether it succeeds or fails at its goal.
  • The policy refers to the approach the agent will take to reach its goal.

     

What is Reinforcement Learning?

RL refers to a method used to train neural computer networks to achieve goals at a superhuman rate. Unlike other methods of training deep neural networks, RL starts the network off with no datasets to base processes off of. Instead, it gains data from both successful and failed tests and it eventually uses that data to find its goal. 

A popular approach to RL is called “Policy Gradients” which uses a penalty-reward system [2]. When a machine is set off to do a specific task, it preforms actions to reach a goal with no policy in mind. Steps that lead to a successful outcome most efficiently are denoted with a plus one for a reward counter, while steps that lead to failure are denoted with a minus one. The agent’s objective is to complete a trial with the highest amount of reward points as possible. The process relies on the off chance that the agent will make an action that leads it to its goal and remembering to incorporate it in future tests. The information it gains from each trial will slowly filter out negative rewards, while positive rewards options become more likely. 

flow chart of how reinforcement learning works from an agent to an environment 

Comparison with Human Learning 

When people are tasked to learn something new, past knowledge and intuition helps greatly in understanding how to solve that task. For a machine using policy gradient, the only method they have in progressing is brute forcing a solution through trial and error. Depending on the complexity of the task, an agent may go through millions of trials and still be less competent than a toddler is. Because of the nature of how RL works, it will be incredibly inefficient in solving problems with a wide range of choices to consider. Not to say that RL is less competent than human learning, but it has its flaws that keep humanity from going completely autonomous in the things we do. 

Example of Reinforcement LearningAlphaGo logo

Applications that incorporate RL have already toppled humanity in some respects. One of the most well-known examples of this is the AI known as AlphaGo [3]. AlphaGo is a computer program developed by Alphabet Inc., purposed to become an expert in the board game “Go”, ergo its name. It has gone on to defeat Mr. Fan Hui, who was arguably the best in history, with a perfect 5-0 victory. It continued its path of success by defeating top international Go players online and achieved a streak of 60 consecutive wins.

The applications of RL are not just limited to games. The engineers at Wayve Corporation have been successful in developing autonomous cars [4]. In under 20 minutes of ongoing tests, computers can follow lanes at near perfection. In those 20 minutes, it was able to understand what to do on the road from data collected from cameras around the vehicle. Here is a video demonstration of RL at work.

Conclusion

By harnessing the basic principal of learning by trial and error, RL has truly proven its technological worth. To the Go community, it has opened new doors to how the 3000-year-old game can be played and to the automobile world with Wayve‘s autonomous driving. The bounds of RL are only defined by the ingenuity of scientists and engineers.

References

[1] Skymind Ai. (n.d.) A Beginner's Guide to Deep Reinforcement Learning. Retrieved from https://skymind.ai/wiki/deep-reinforcement-learning

[2] Github. (2016) Deep Reinforcement Learning: Pong to Pixels. Retrieved from http://karpathy.github.io/2016/05/31/rl/

[3] Deepmind. (n.d.) AlphaGo. Retrieved from https://deepmind.com/research/alphago/

[4] Wayve. (2018) Learning to Drive in a Day. Retrieved from https://wayve.ai/blog/learning-to-drive-in-a-day-with-reinforcement-learning