Boosting Maximum Entropy Reinforcement Learning via One-Step Flow Matching
Zeqiao Li, Yijing Wang, Haoyu Wang, Zheng Li, Zhiqiang Zuo

TL;DR
This paper introduces FLAME, a new framework that enhances maximum entropy reinforcement learning with one-step flow matching, achieving expressive policies with lower inference latency and improved exploration.
Contribution
FLAME develops a Q-Reweighted flow matching objective, a bias-corrected entropy estimator, and integrates MeanFlow for efficient one-step control in MaxEnt RL.
Findings
FLAME outperforms Gaussian baselines on MuJoCo tasks.
FLAME matches multi-step diffusion policies with lower inference cost.
The proposed methods improve exploration and policy expressiveness.
Abstract
Diffusion policies are expressive yet incur high inference latency. Flow Matching (FM) enables one-step generation, but integrating it into Maximum Entropy Reinforcement Learning (MaxEnt RL) is challenging: the optimal policy is an intractable energy-based distribution, and the efficient log-likelihood estimation required to balance exploration and exploitation suffers from severe discretization bias. We propose \textbf{F}low-based \textbf{L}og-likelihood-\textbf{A}ware \textbf{M}aximum \textbf{E}ntropy RL (\textbf{FLAME}), a principled framework that addresses these challenges. First, we derive a Q-Reweighted FM objective that bypasses partition function estimation via importance reweighting. Second, we design a decoupled entropy estimator that rigorously corrects bias, which enables efficient exploration and brings the policy closer to the optimal MaxEnt policy. Third, we integrate…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsReinforcement Learning in Robotics · Adversarial Robustness in Machine Learning · Domain Adaptation and Few-Shot Learning
