Flow-Based Single-Step Completion for Efficient and Expressive Policy Learning

Prajwal Koirala; Cody Fleming

arXiv:2506.21427·cs.LG·February 26, 2026

Flow-Based Single-Step Completion for Efficient and Expressive Policy Learning

Prajwal Koirala, Cody Fleming

PDF

Open Access 3 Reviews

TL;DR

The paper introduces SSCP, a novel flow-based policy that enables one-shot action generation, combining the expressiveness of generative models with the efficiency of unimodal policies, applicable across various RL settings.

Contribution

Proposes SSCP, a single-step completion policy trained with an augmented flow-matching objective for efficient, expressive, and scalable policy learning in reinforcement learning.

Findings

01

SSCP achieves faster inference than diffusion models.

02

SSCP performs well across offline, online, and offline-to-online RL.

03

SSCP effectively exploits subgoal structures in goal-conditioned RL.

Abstract

Generative models such as diffusion and flow-matching offer expressive policies for offline reinforcement learning (RL) by capturing rich, multimodal action distributions, but their iterative sampling introduces high inference costs and training instability due to gradient propagation across sampling steps. We propose the Single-Step Completion Policy (SSCP), a generative policy trained with an augmented flow-matching objective to predict direct completion vectors from intermediate flow samples, enabling accurate, one-shot action generation. In an off-policy actor-critic framework, SSCP combines the expressiveness of generative models with the training and inference efficiency of unimodal policies, without requiring long backpropagation chains. Our method scales effectively to offline, offline-to-online, and online RL settings, offering substantial gains in speed and adaptability over…

Peer Reviews

Decision·ICLR 2026 Poster

Reviewer 01Rating 6Confidence 3

Strengths

- This paper is well written and easy to follow. - Stable training and efficient sampling are core concerns in diffusion policies in RL. - The proposed method is flexible to different settings, like offline RL, online RL, and offline-to-online RL.

Weaknesses

- The Q update function (4) utilizes the standard TD error. However, various works in offline RL propose that there is an overestimation error of the Q function caused by the distribution shift. Thus, several works will choose conservative Q learning techniques like IQL. What about the performance of using IQL in the offline setting? - What is the difference between online RL and offline RL when applying SSCP? - In offline-to-online experiments like Fig.4 and Fig.5, it seems that online fine-t

Reviewer 02Rating 8Confidence 1

Strengths

- Big efficiency gains while maintaining or exceeding baseline performance. - The paper is well-written and the method is well documented. - Figure 8 in the appendix shows the strength of SSCP against shortcut models and makes the case that relying on bootstrap targets induces instability in the training. **I think that this figure is important since it clearly motivates the use of SSCP and thus would like to see it in the main text.**

Weaknesses

N/A

Reviewer 03Rating 4Confidence 3

Strengths

1. Novel and well-motivated approach: The completion vector formulation elegantly addresses a fundamental limitation of diffusion/flow policies—the need for iterative sampling—while maintaining expressiveness for multimodal action distributions. Unlike bootstrap-based shortcut methods [1], SSCP uses ground-truth targets from the dataset, avoiding early training instability. 2. A significant practical advantage is that SSCP enables training generative policies without backpropagating through iter

Weaknesses

1. While the paper compares against FQL [Park et al., 2025], there are other recent few-step policy methods [3-4] that should be discussed and compared 2. The paper doesn't provide clear guidance on when SSCP is expected to outperform alternatives 3. While Table 9 shows multi-step rollout results, the analysis is limited 4. In Figure 7, the performance change could be quite large depending on hyperparameters chosen

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsRecommender Systems and Techniques · Data Stream Mining Techniques · Privacy-Preserving Technologies in Data

MethodsDiffusion · SPEED: Separable Pyramidal Pooling EncodEr-Decoder for Real-Time Monocular Depth Estimation on Low-Resource Settings