Entropy-Regularized Adjoint Matching for Offline Reinforcement Learning
Abdelghani Ghanem, Mounir Ghogho

TL;DR
This paper introduces ME-AM, a novel offline RL framework that enhances policy expressivity and exploration by integrating entropy maximization and a mixture behavior prior within a flow-matching model.
Contribution
It proposes a unified approach combining entropy regularization and a mixture prior to overcome support and bias limitations in flow-based offline RL methods.
Findings
ME-AM outperforms existing methods on sparse-reward continuous control tasks.
The entropy mechanism reduces popularity bias, enabling better policy extraction.
The mixture prior broadens support, improving exploration in out-of-distribution regions.
Abstract
Integrating expressive generative policies, such as flow-matching models, into offline reinforcement learning (RL) allows agents to capture complex, multi-modal behaviors. While Q-learning with Adjoint Matching (QAM) stabilizes policy optimization via the continuous adjoint method, it remains inherently bound to the fixed behavior distribution. This dependence induces a \textit{popularity bias} that can suppress high-reward actions in low-density regions, and creates a \textit{support binding} that restricts off-manifold exploration. Existing workarounds, such as appending \textit{residual} Gaussian policies, often re-introduce the expressivity bottlenecks associated with unimodal distributions. In this work, we propose \textit{Maximum Entropy Adjoint Matching} (ME-AM), a unified framework that addresses these limitations within the continuous flow formulation. ME-AM incorporates two…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
