One-Step Flow Q-Learning: Addressing the Diffusion Policy Bottleneck in Offline Reinforcement Learning
Thanh Nguyen, Chang D. Yoo

TL;DR
This paper introduces OFQL, a new offline reinforcement learning method that enables one-step action generation, significantly improving speed and robustness while achieving state-of-the-art results on benchmarks.
Contribution
OFQL reformulates diffusion Q-learning within the Flow Matching framework to enable direct one-step action generation without auxiliary modules or distillation.
Findings
OFQL outperforms multi-step DQL in benchmark tests.
OFQL reduces computation during training and inference.
OFQL achieves state-of-the-art performance on D4RL.
Abstract
Diffusion Q-Learning (DQL) has established diffusion policies as a high-performing paradigm for offline reinforcement learning, but its reliance on multi-step denoising for action generation renders both training and inference slow and fragile. Existing efforts to accelerate DQL toward one-step denoising typically rely on auxiliary modules or policy distillation, sacrificing either simplicity or performance. It remains unclear whether a one-step policy can be trained directly without such trade-offs. To this end, we introduce One-Step Flow Q-Learning (OFQL), a novel framework that enables effective one-step action generation during both training and inference, without auxiliary modules or distillation. OFQL reformulates the DQL policy within the Flow Matching (FM) paradigm but departs from conventional FM by learning an average velocity field that directly supports accurate one-step…
Peer Reviews
Decision·ICLR 2026 Poster
The method is simple, clear, and effective. By replacing only the diffusion policy component with the mean-flow policy, the approach achieves both higher sampling efficiency and competitive performance. The toy example nicely illustrates the advantage of reparameterizing from $v$ to $u$, providing a clearer intuition for the underlying mechanism.
Given that mean-flow generative modeling has already shown strong one-step FID results on image generation tasks, it would be valuable to see this approach applied to more complex environments beyond D4RL, such as robotic control or high-dimensional decision-making settings.
* **Clear conceptual advancement:** Reformulating DQL under the flow-matching framework and introducing an average velocity field is a novel and elegant idea that directly addresses the core inefficiency of multi-step denoising. * **Simplicity and effectiveness:** Unlike prior one-step approaches that depend on auxiliary modules or policy distillation, OFQL remains conceptually clean while achieving superior results. * **Strong empirical results:** The method outperforms DQL and other diffusion-
* The theoretical justification for why learning an **average velocity field** leads to better one-step performance could be elaborated further. Currently, the paper provides an intuitive explanation but lacks a deeper analytical connection to diffusion dynamics.
1. The method shows empirical advantages in policy performance, training speed and inference time. 2. The paper is easy to follow.
1. The proposed method lacks novelty. The only main difference between the proposed method and DQL is replacing the diffusion loss in actor training with a MeanFlow loss. 2. The experiments are not adequate. Only results on state-based D4RL tasks are included, and no visual observation task results are reported. 3. The argument in Lines 262-264 is not clear. Flow matching cannot "in principle, enable one-step generation", as the sampling trajectory is straight only when the target distribution i
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsCognitive Science and Education Research
