Reinforcement Learning for Robotic Control

When I began to study Reinforcement Learning (RL) in robotics, I noted that there was no literature available that reviewed the varying methods of RL for complex, multistep tasks. To solve this issue, and to provide a resource for future researchers working on robotic control with RL, I wrote a review paper focused on the multi-action sequence of pick-and-place. The task involves object recognition, grasp selection, manipulation, and placement, a relevant series of actions for most robotic applications. This survey accumulates all data published in the field on this topic, to provide a resource which can be used for benchmarking RL control strategies in the future. I was the primary author on this paper and wrote the content myself, with reviewal support from my supervisor and a senior member of my lab. The paper was published in MDPI robotics and can be found at this link

Pick and Place

Since the publication of the paper, I have focused my research on establishing the RL for Robotics project for the AI for Manufacturing Lab at the University of Waterloo. The goal of this project is to design, simulate, and validate new RL control strategies for simple tasks such as robotic grasping and robotic pick-and-place.

The foundation for the RL control strategy is the Markov Decision Process (MDP). The MDP is a process which encourages specific behaviours and discourages others by feeding rewards or punishments to the agent based on the agent’s actions in the environment. The agent selects an action based on its policy function for actions given states. For the case of robotic pick-and-place, the “state” fed to the robotic agent is the RGBD rendered image of the 6 DoF robotic arm in the environment. The “action” is the x, y, z, pitch, roll, yaw, and gripper position command which is fed to the robotic arm for execution. The diagram below shows the cycle of the MDP for robotic grasping.

MDP Learning Framework

The process of exploring the environment, creating a value function representation, and updating the control policy, requires a complex methodology. The model-free methods that I developed and tested for this task were TD3 and PPO.

The robotic agent learned the control policies through interaction with a custom OpenAI Pybullet simulation environment. To simplify the task so the RL agent did not need to learn inverse kinematics, the end effector was controlled using the Samuel Buss Inverse Kinematics library.

For this form of RL, a minimum of two neural networks are required. As shown in the figure below, a convolutional neural network (CNN) takes an RGBD image from the simulated environment and extracts a compressed vector of positional information. The extracted positional information is fed into a feed-forward neural network that outputs the ideal action command that achieves the maximum reward. For this project, ResNet, Densenet, and custom CNN structures were tested. Due to limited GPU availability and for sample efficiency, a shallow 3-layer CNN was applied for the final model.  

CNN Training

All machine learning methods rely on a series of hyperparameters such as the number of epochs, batch size, choice of activation function, and the number of network hidden layers. Due to the complex nature of RL for robotic control, hyperparameter tuning determines the stability of training and the ability of the model to achieve convergence. To search the space of hyperparameters efficiently, I implemented an open-source Gaussian-Process-based Tree-Structured Parzen Estimator. 

The parallel coordinate plots below show the different hyperparameter combinations which were tested during two different studies. Each line starts on the left of the plot with the reward achieved at the end of an exploratory episode. Darker lines indicate higher rewards for a particular combination of hyperparameters. The studies each explored 200 different hyperparameter combinations.

Tuning Hyperarameters 1: Tuning Hyperarameters 2:

The results of training “PandaReach” with an optimal combination of hyperparameters for 2 million training steps can be seen below. The results show that after 1.25 million training steps there is a significant improvement in performance.

Optimization Plot Convergence

The renders of Panda Reach after 200 thousand and 2-million steps can be seen below. After 200 thousand steps, the agent has significant random motion (GIF on the left), but after 2-million steps, the agent smoothly approaches the block (GIF on the right).

Panda Reach 500K StepsPanda Reach 500k

Further discussion on how the real-world robotic arm is controlled as seen in the video below can be found at this page: Real-World Robotic Control.

Panda

632c8cc788846046766b3c9465cc5f8d

sequence_12.mp424.14 MB