Yingluo Xun, Master’s candidate
David R. Cheriton School of Computer Science
In reinforcement learning, entropy-regularized value function (in policy space) has attracted a lot of attention recently due to its effect on smoothing the value function, and the effect on encouraging exploration. However, there is a discrepancy between the regularized objective function and the original objective function in existing methods, which would potentially result in a discrepancy between the trained policy and the optimal policy, as the policy directly depends on the value function in the reinforcement learning framework.
With the motivation to remove the discrepancy, we study the convergence behavior as the regularizing parameter approaches 0. There are two main contributions of our work. Firstly, we show that the regularized value iteration with decreasing regularization (smoothing) parameter in any decreasing rate converges in the model-based (tabular) case. Secondly, we express the optimization error as a tractable function of the smoothing parameter in the model-free case, and propose an optimal decreasing rate of the regularization term to reach a balance between optimization error and smoothing bias.
200 University Avenue West
Waterloo, ON N2L 3G1