DiffOP: Reinforcement Learning of Optimization-Based Control Policies via Implicit Policy Gradients

Yuexin Bian; Jie Feng; Yuanyuan Shi

arXiv:2411.07484·eess.SY·November 20, 2025

DiffOP: Reinforcement Learning of Optimization-Based Control Policies via Implicit Policy Gradients

Yuexin Bian, Jie Feng, Yuanyuan Shi

PDF

Open Access 1 Video

TL;DR

DiffOP introduces a novel reinforcement learning framework that learns optimization-based control policies implicitly, directly optimizing control costs without value function approximation, and demonstrates convergence and effectiveness on nonlinear control tasks.

Contribution

It presents a new method for learning implicit control policies via policy gradients with analytical derivatives, avoiding value function approximation.

Findings

01

Converges to an $psilon$-stationary point within $psilon^{-1}$ iterations.

02

Effective on nonlinear control and power system voltage control tasks.

03

Provides analytical policy gradients through implicit differentiation.

Abstract

Real-world control systems require policies that are not only high-performing but also interpretable and robust. A promising direction toward this goal is model-based control, which learns system dynamics and cost functions from historical data and then uses these models to inform decision-making. Building on this paradigm, we introduce DiffOP, a novel framework for learning optimization-based control policies defined implicitly through optimization control problems. Without relying on value function approximation, DiffOP jointly learns the cost and dynamics models and directly optimizes the actual control costs using policy gradients. To enable this, we derive analytical policy gradients by applying implicit differentiation to the underlying optimization problem and integrating it with the standard policy gradient framework. Under standard regularity conditions, we establish that…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

DiffOP: Reinforcement Learning of Optimization-Based Control Policies via Implicit Policy Gradients· underline

Taxonomy

TopicsAdvanced Control Systems Optimization