A Pontryagin Perspective on Reinforcement Learning
Onno Eberhard, Claire Vernade, Michael Muehlebach

TL;DR
This paper introduces a novel open-loop reinforcement learning paradigm based on Pontryagin's principle, offering new algorithms with convergence guarantees that outperform existing methods on control tasks.
Contribution
It presents the first open-loop RL algorithms grounded in Pontryagin's principle, diverging from traditional Bellman-based methods, with theoretical guarantees and empirical success.
Findings
Algorithms outperform baselines on control tasks
Convergence guarantees established for all methods
Effective in high-dimensional MuJoCo environments
Abstract
Reinforcement learning has traditionally focused on learning state-dependent policies to solve optimal control problems in a closed-loop fashion. In this work, we introduce the paradigm of open-loop reinforcement learning where a fixed action sequence is learned instead. We present three new algorithms: one robust model-based method and two sample-efficient model-free methods. Rather than basing our algorithms on Bellman's equation from dynamic programming, our work builds on Pontryagin's principle from the theory of open-loop optimal control. We provide convergence guarantees and evaluate all methods empirically on a pendulum swing-up task, as well as on two high-dimensional MuJoCo tasks, significantly outperforming existing baselines.
Peer Reviews
Decision·ICLR 2025 Conference Withdrawn Submission
The paper is well written and provides connection from control theory to existing RL concepts very well. They also analyze their methods efficacy using a toy example and do a great job explaining it. They provide both model-free and model-based algorithms using Pontryagin’s principle.
Major issues: I see three major issues. **Comparison to previous work:** I believe there are similarities to PILCO [1] in terms of how you estimate the Jacobian of the dynamics (lines 125-132). This a fairly popular model based method in RL and I believe you could compare against it [1,2]. Moreover, you assume deterministic dynamics whereas PILCO is more general purpose with stochastic dynamics. I believe there is some analogue of the Jacobian estimate in PILCO for deterministic dynamics that
It is interesting how the authors bridge differential dynamic programming into RL, brining a new perspective onto policy learning. The presentation of the paper is good, easy to follow, and in general nicely written. The theoretical analysis is nicely done, with satisfactory results, and validated through numerical experiments.
Novelty: In light of the literature in trajectory optimization, specifically Differential Dynamic Programming (DDP), the content of this work has already been studied extensively, e.g. [1]. The elements in the proposed method seem neither new (using samples to approximate the transition probabilities, and solving a finite-horizon deterministic control problem with first order method such as gradient descent); or particularly novel, as either 1) the control community has extensively studied effec
1. The paper contributes to reinforcement learning by addressing the relatively underexplored area of open-loop RL, which has potential benefits in environments with deterministic or predictable dynamics. 2. The authors provide a convergence proof for their proposed algorithms, enhancing the theoretical rigor and assuring readers of the stability of the methods under specified conditions.
1. The introduction and related work sections would benefit from a more comprehensive review of existing research on open-loop control and open-loop reinforcement learning, especially high-impact studies. By not including these, the paper risks underselling the importance of its contributions in the context of current research. Citing influential works would help position this study within the broader field and underscore the significance of open-loop RL. 2. The paper introduces both open-loop
The authors approach the decision-making problem from a relatively novel perspective and design multiple feasible algorithms and experiments within this framework, demonstrating the advantages of OLRL over closed-loop RL.
1. Aside from the benefits of reduced sensor and computation costs, I still struggle to understand the practical advantages of OLRL compared to closed-loop RL. I also have questions about certain statements made in the introduction. For instance, in line 49, the authors claim that OLRL is more robust if the environment changes. However, in unforeseen environment changes, it's difficult to determine whether OLRL or closed-loop RL would perform better. Furthermore, if the environmental change was
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsInnovation Diffusion and Forecasting
