Guided Flow Policy: Learning from High-Value Actions in Offline Reinforcement Learning
Franki Nguimatsia Tiofack, Th\'eotime Le Hellard, Fabian Schramm, Nicolas Perrin-Gilbert, Justin Carpentier

TL;DR
Guided Flow Policy (GFP) improves offline reinforcement learning by focusing on high-value actions through a coupled flow-matching and actor approach, leading to state-of-the-art results across diverse benchmarks.
Contribution
GFP introduces a novel coupling of flow-matching and actor models to prioritize high-value actions, enhancing performance in offline RL tasks.
Findings
Achieves state-of-the-art results on 144 tasks from multiple benchmarks.
Substantially improves performance on suboptimal datasets.
Effective in both state-based and pixel-based tasks.
Abstract
Offline reinforcement learning often relies on behavior regularization that enforces policies to remain close to the dataset distribution. However, such approaches fail to distinguish between high-value and low-value actions in their regularization components. We introduce Guided Flow Policy (GFP), which couples a multi-step flow-matching policy with a distilled one-step actor. The actor directs the flow policy through weighted behavior cloning to focus on cloning high-value actions from the dataset rather than indiscriminately imitating all state-action pairs. In turn, the flow policy constrains the actor to remain aligned with the dataset's best transitions while maximizing the critic. This mutual guidance enables GFP to achieve state-of-the-art performance across 144 state and pixel-based tasks from the OGBench, Minari, and D4RL benchmarks, with substantial gains on suboptimal…
Peer Reviews
Decision·ICLR 2026 Poster
The method is elegant and easy to follow. It cleanly integrates flow matching into the BRAC framework: a value-aware flow policy shapes the action distribution, and a distilled one-step actor provides fast inference. The bidirectional guidance narrative is intuitive, and the overall training loop is straightforward to implement. The empirical evaluation is thorough. The paper reports results on 129 tasks spanning OGBench, Minari, and D4RL, providing broad coverage and reducing the risk of cherry
First, the novelty is moderate. The method still fits squarely within the BRAC family: it combines an expressive flow-matching policy with value-aware cloning and then distills into a one-step actor. While the synthesis is thoughtful and practically useful, the conceptual step beyond prior behavior-regularization plus expressive policy models feels incremental. Second, the paper’s structure could be clearer, especially in the Experiments section. Main results and analysis are interleaved, which
This paper presents a clean framework with empirically sound results. The idea of using the AWR weighting to train the flow network is a neat way to achieve policy improvement wrt a learned Q function. It is useful computationally that the TD errors for Q-learning can be calculated using the distilled one-step policy, avoiding the need for an expensive ODE integration during training. In terms of clarity and quality, the paper reads well and is easy to follow. See weaknesses section for additi
As this framework in essence uses two methods of feedback from the Q function, the paper would greatly benefit from a clear analysis on the effects of both of these terms. For example, two clear baselines are the same method where either the flow policy is not weighted (or if this is exactly FQL, some text making this clear) and where the one-step policy does not utilize the DDPG loss. Building on the above, an analysis as to why the two feedback methods are complimentary would take this paper
- The paper considers a comprehensive set of benchmark tasks which make the comparison between the proposed method and the baselines very convincing. I also appreciate that the authors seem to properly tune the hyperparameters for the baselines, which make the comparisons fair. - The proposed method is presented in a concise and clear manner and the algorithmic design has no technical issues.
The paper make some factually questionable claims - Table 1: all prior methods (IQL, TD3+BC, ReBRAC, FQL) are listed under "handles suboptimal data (x)" which implies that these methods cannot handle suboptimal data. This is misleading because most offline RL methods listed and in fact most offline RL methods in general can handle suboptimal data. It would be good to further clarify and be precise about what "handles suboptimal data (x)" actually means here. - The authors introduce Value-aware b
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsReinforcement Learning in Robotics · Domain Adaptation and Few-Shot Learning · Adversarial Robustness in Machine Learning
