# Policy Iterations for Reinforcement Learning Problems in Continuous Time   and Space -- Fundamental Theory and Methods

**Authors:** Jaeyoung Lee, Richard S. Sutton

arXiv: 1705.03520 · 2021-04-06

## TL;DR

This paper introduces two novel policy iteration methods, DPI and IPI, for reinforcement learning in continuous time and space, providing theoretical foundations, properties, and practical case studies including an inverted-pendulum example.

## Contribution

It develops and analyzes differential and integral policy iteration methods tailored for continuous-time RL, extending classical PI ideas and supporting existing algorithms with rigorous theory.

## Key findings

- Proved admissibility, uniqueness, and convergence properties of the methods.
- Demonstrated effectiveness through case studies and simulations.
- Supported model-based and partially model-free implementations.

## Abstract

Policy iteration (PI) is a recursive process of policy evaluation and improvement for solving an optimal decision-making/control problem, or in other words, a reinforcement learning (RL) problem. PI has also served as the fundamental for developing RL methods. In this paper, we propose two PI methods, called differential PI (DPI) and integral PI (IPI), and their variants, for a general RL framework in continuous time and space (CTS), where the environment is modeled by a system of ordinary differential equations (ODEs). The proposed methods inherit the current ideas of PI in classical RL and optimal control and theoretically support the existing RL algorithms in CTS: TD-learning and value-gradient-based (VGB) greedy policy update. We also provide case studies including 1) discounted RL and 2) optimal control tasks. Fundamental mathematical properties -- admissibility, uniqueness of the solution to the Bellman equation (BE), monotone improvement, convergence, and optimality of the solution to the Hamilton-Jacobi-Bellman equation (HJBE) -- are all investigated in-depth and improved from the existing theory, along with the general and case studies. Finally, the proposed ones are simulated with an inverted-pendulum model and their model-based and partially model-free implementations to support the theory and further investigate them beyond.

## Full text

_Full body text omitted from this summary view._ Fetch the complete paper as Markdown: https://tomesphere.com/paper/1705.03520/full.md

## Figures

23 figures with captions in the complete paper: https://tomesphere.com/paper/1705.03520/full.md

## References

78 references — full list in the complete paper: https://tomesphere.com/paper/1705.03520/full.md

---
Source: https://tomesphere.com/paper/1705.03520