Policy iterations for reinforcement learning problems in continuous time and space \textemdash Fundamental theory and methods

Title Policy iterations for reinforcement learning problems in continuous time and space \textemdash Fundamental theory and methods
Author
Keywords
Abstract

Policy iteration (PI) is a recursive process of policy evaluation and improvement for solving an optimal decision-making/control problem, or in other words, a reinforcement learning (RL) problem. PI has also served as the fundamental for developing RL methods. In this paper, we propose two PI methods, called differential PI (DPI) and integral PI (IPI), and their variants, for a general RL framework in continuous time and space (CTS), where the environment is modeled by a system of ordinary differential equations (ODEs). The proposed methods inherit the current ideas of PI in classical RL and optimal control and theoretically support the existing RL algorithms in CTS: TD-learning and value-gradient-based (VGB) greedy policy update. We also provide case studies including (1) discounted RL and (2) optimal control tasks. Fundamental mathematical properties \– admissibility, uniqueness of the solution to the Bellman equation (BE), monotone improvement, convergence, and optimality of the solution to the Hamilton\–Jacobi\–Bellman equation (HJBE) \– are all investigated in-depth and improved from the existing theory, along with the general and case studies. Finally, the proposed ones are simulated with an inverted-pendulum model and their model-based and partially model-free implementations to support the theory and further investigate them beyond.

Year of Publication
2021
Journal
Automatica
Volume
126
Number of Pages
109421, 15 pages
Date Published
04/2021
ISSN Number
0005-1098
DOI
10.1016/j.automatica.2020.109421
Download citation