One-Step Generative Policies with Q-Learning: A Reformulation of MeanFlow
Zeyuan Wang, Da Li, Yulin Chen, Ye Shi, Liang Bai, Tianyuan Yu, Yanwei Fu

TL;DR
This paper presents a novel one-step generative policy for offline reinforcement learning that directly maps noise to actions using a reformulated MeanFlow, enabling expressive, stable, and efficient Q-learning-based policy training.
Contribution
It introduces a residual reformulation of MeanFlow that allows direct noise-to-action generation, simplifying training and improving expressivity in offline RL.
Findings
Achieves strong performance on 73 tasks across OGBench and D4RL benchmarks.
Supports multimodal action distributions with stable learning.
Enables single-stage training for offline and offline-to-online RL.
Abstract
We introduce a one-step generative policy for offline reinforcement learning that maps noise directly to actions via a residual reformulation of MeanFlow, making it compatible with Q-learning. While one-step Gaussian policies enable fast inference, they struggle to capture complex, multimodal action distributions. Existing flow-based methods improve expressivity but typically rely on distillation and two-stage training when trained with Q-learning. To overcome these limitations, we propose to reformulate MeanFlow to enable direct noise-to-action generation by integrating the velocity field and noise-to-action transformation into a single policy network-eliminating the need for separate velocity estimation. We explore several reformulation variants and identify an effective residual formulation that supports expressive and stable policy learning. Our method offers three key advantages:…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
Taxonomy
TopicsReinforcement Learning in Robotics · Domain Adaptation and Few-Shot Learning · Robot Manipulation and Learning
