Uniformly Constrained Reinforcement Learning

Title Uniformly Constrained Reinforcement Learning

We propose new multi-objective reinforcement learning algorithms that aim to find a globally Pareto-optimal deterministic policy that uniformly (in all states) maximizes a reward subject to a uniform probabilistic constraint over reaching forbidden states of a Markov decision process. Our requirements arise naturally in the context of safety-critical systems, but pose a significant unmet challenge. This class of learning problem is known to be hard and there are no off-the-shelf solutions that fully address the combined requirements of determinism and uniform optimality.

Having formalized our requirements and highlighted the specific challenge of learning instability, using a simple counterexample, we define from first principles a stable Bellman operator that we prove partially respects our requirements. This operator is therefore a partial solution to our problem, but produces conservative polices in comparison to our previous approach, which was not designed to satisfy the same requirements. We thus propose a relaxation of the stable operator, using\ adaptive hysteresis, that forms the basis of a heuristic approach that is stable w.r.t. our counterexample and learns policies that are less conservative than those of the stable operator and our previous algorithm. In comparison to our previous approach, the policies of our adaptive hysteresis algorithm demonstrate improved monotonicity with increasing constraint probabilities, which is one of the characteristics we desire. We demonstrate that adaptive hysteresis works well with dynamic programming and reinforcement learning, and can be adapted to function approximation.

Year of Publication
Accepted for publication in Journal of Autonomous Agents and Multi-Agent Systems (JAAMAS): Special Issue on Multi-Objective Decision Making (MODeM)
Download citation