Aravind Balakrishnan, Master’s candidate
David R. Cheriton School of Computer Science
The behaviour planning subsystem, which is responsible for high-level decision making and planning, is an important aspect of an autonomous driving system. There are advantages to using a learned behaviour planning system instead of traditional rule-based approaches. However, high quality labelled data for training behaviour planning models is hard to acquire. Thus, reinforcement learning (RL), which can learn a policy from simulations, is a viable option for this problem. However, modelling inaccuracies between the simulator and the target environment, called the ‘transfer gap,’ hinders its deployment in a real autonomous vehicle. High-fidelity simulators, which have a smaller transfer gap, come with large computational costs that are not favourable for RL training. Therefore, we often have to settle for a fast, but lower fidelity simulator that often exacerbates the transfer learning problem.
In this thesis, we study how a low-fidelity 2D simulator can be used in place of a slower 3D simulator for training RL behaviour planning models and analyze the resulting policies in comparison with a rule-based approach. We develop WiseMove, an RL framework for autonomous driving research that supports hierarchical RL, to serve as the low-fidelity source simulator. A transfer learning scenario is set up from WiseMove to an Unreal-based simulator for the Autonomoose system to study and close the transfer gap.
We find that perception errors in the target simulator contribute the most to the transfer gap. These errors, when naively modelled in WiseMove, provide a policy that performs better in the target simulator than a carefully constructed rule-based policy. Applying domain randomization on the environment yields an even better policy. The final RL policy reduces the failures due to perception errors from 10% to 2%. We also observe that the RL policy has less reliance on the velocity compared to the rule-based algorithm as it is unreliable in the target simulator. To understand the exact learned behaviour, we also distill the RL policy using a decision tree to obtain an interpretable rule-based policy. We show that constructing a rule-based policy manually to efficiently handle perception errors is not trivial. Future work can explore more driving scenarios under fewer constraints to further validate this result.