Ph.D. Defence Notice: "Learning to Engage: An Application of Deep Reinforcement Learning in Living Architecture Systems" by Lingheng Meng

Candidate: Lingheng Meng

Title: Learning to Engage: An Application of Deep Reinforcement Learning in Living Architecture Systems

Date: May 17, 2023

Time: 9:00 AM

Place: REMOTE ATTENDANCE

Supervisor(s): Kulic, Dana (Adjunct) - Gorbet, Rob

Abstract:

Physical agents that can autonomously generate engaging, life-like behavior will lead to more responsive and interesting robots and other autonomous systems. Although many advances have been made for one-to-one interactions in well controlled settings, future physical agents should be capable of interacting with humans in natural settings, including group interaction. In order to generate engaging behaviors, the autonomous system must first be able to estimate its human partners' engagement level, then take actions to maximize the estimated engagement. In this thesis, we take Living Architecture Systems (LAS), architecture scale interactive systems capable of group interaction through distributed embedded sensors and actuators, as a testbed and apply Deep Reinforcement Learning (DRL) by treating the estimate of engagement as reward signal in order to automatically generate engaging behavior. Applying DRL to LAS is difficult, because of DRL’s low data efficiency, overestimation problem, and issues with state observability, especially considering the large observation and action space of LAS.

We first propose an approach for estimating engagement during group interaction by simultaneously taking into account active and passive interaction, and use the measure as the reward signal within a reinforcement learning framework to learn engaging interactive behaviors. The proposed approach is implemented in a LAS in a museum setting. We compare the learning system to a baseline using pre-scripted interactive behaviors. Analysis based on sensory data and survey data shows that adaptable behaviors within an expert-designed action space can achieve higher engagement and likeability. However, this initial approach relies on a manually defined reward and assumes a known, concise definition of the state and action space to address issues of slow learning, sample efficiency and state/action specification of DRL.

To relax these restrictive assumptions, we first analyze the effect of multi-step methods on alleviating the overestimation problem in DRL, and building on top of Deep Deterministic Policy Gradient (DDPG) propose Multi-step DDPG (MDDPG) and Mixed Multi-step DDPG (MMDDPG). Empirically, we show that both MDDPG and MMDDPG are significantly less affected by the overestimation problem than vanilla DDPG, which consequently results in better final performance and learning speed. Then, to handle Partially Observable MDPs (POMDPs), we propose Long-Short-Term-Memory-based Twin Delayed Deep Deterministic Policy Gradient (LSTM-TD3) by introducing a memory component to TD3, and compare its performance with other DRL algorithms in both MDPs and POMDPs. Our results demonstrate the significant advantages of the memory component in addressing POMDPs, including the ability to handle missing and noisy observation data. After that, we investigate partial observability as a potential failure source of applying DRL to robot control tasks, which can occur when researchers are not confident whether the observation space fully represents the underlying state. We compare the performance of TD3, SAC and PPO under various partial observability conditions, and find that TD3 and SAC become easily stuck in local optima and underperform PPO. We propose multi-step versions of the vanilla TD3 and SAC to improve their robustness to partial observability.

Based on our study with manually designed reward functions, which is the estimate of engagement, and the fundamental research on DRL, we further reduce the reliance on designers' field knowledge, and propose to learn a reward function from human preferences on engaging behavior by taking advantage of preference learning algorithms. Our simulation results show that the reward function induced from human preference is able to lead to a policy that generates engaging behavior.

Support Waterloo Engineering