Draft-and-Target Sampling for Video Generation Policy

Qikang Zhang; Yingjie Lei; Wei Liu; Daochang Liu

arXiv:2603.13438·cs.CV·March 17, 2026

Draft-and-Target Sampling for Video Generation Policy

Qikang Zhang, Yingjie Lei, Wei Liu, Daochang Liu

PDF

Open Access 3 Reviews

TL;DR

This paper introduces Draft-and-Target Sampling, a training-free diffusion inference method that significantly speeds up video generation policies by using a self-play denoising approach, token chunking, and progressive acceptance strategies.

Contribution

It proposes a novel, training-free diffusion inference paradigm for efficient video generation, combining draft and target sampling with new speedup techniques.

Findings

01

Achieves up to 2.1x speedup in video generation

02

Improves efficiency of state-of-the-art methods with minimal success rate loss

03

Demonstrates effectiveness on three benchmark datasets

Abstract

Video generation models have been used as a robot policy to predict the future states of executing a task conditioned on task description and observation. Previous works ignore their high computational cost and long inference time. To address this challenge, we propose Draft-and-Target Sampling, a novel diffusion inference paradigm for video generation policy that is training-free and can improve inference efficiency. We introduce a self-play denoising approach by utilizing two complementary denoising trajectories in a single model, draft sampling takes large steps to generate a global trajectory in a fast manner and target sampling takes small steps to verify it. To further speedup generation, we introduce token chunking and progressive acceptance strategy to reduce redundant computation. Experiments on three benchmarks show that our method can achieve up to 2.1x speedup and improve…

Peer Reviews

Decision·Submitted to ICLR 2026

Reviewer 01Rating 2Confidence 3

Strengths

* The paper proposes system level optimization, e.g., token chunking, to improve the efficiency, which could offer practical values to the community. * Results seem to provide empirical performance and efficiency gains.

Weaknesses

* The core idea is the speculative decoding, which has been widely adopted in the field in LLMs. The modifications of speculative decoding in the discrete space from this paper include using large-stepsize-ODE as draft model and progressive acceptance, which are rather straightforward implementations. Therefore the contribution of the paper should be more explicitly discussed compared to prior works including but not limited to LLMs. * Given that the paper uses the same model as draft and targe

Reviewer 02Rating 6Confidence 3

Strengths

1. The paper touches on an overlooked but very important problem with the current paradigm for video generation. It is generally too slow to be useful for robotics policies and planning algorithms. This paper is a promising step in the right direction. 2. The approach is simple and easy to implement. 3. Quantitative performance relative to baselines is promising.

Weaknesses

1. The technical novelty is somewhat limited. No new models or approaches seem to be proposed in this paper. It appears that the contribution of this paper is largely an application of an existing idea to the realm of robotic control. 2. While performance is promising, the speedup is generally modest (about 2x). It is not clear if this speedup outweighs the additional complexity of the approach. 3. Data domain of video generation is quite constrained (robotic environments), there are no expe

Reviewer 03Rating 2Confidence 3

Strengths

1. The paper presents a novel perspective on video generation policies, focusing on improving inference efficiency. To the best of my knowledge, the proposed strategy of combining large and small diffusion steps for speculative decoding has not been explored before. 2. The experimental evaluation is comprehensive, covering three distinct domains, and the results show consistent and satisfactory performance.

Weaknesses

1. **Lack of theoretical analysis of acceleration**: While the empirical study is extensive, the paper lacks theoretical discussion or quantitative analysis of the acceleration achieved: - The assumption in Line 50 that video generation policies usually have low resolutions is questionable. Recent works such as Vidar [1] demonstrate a clear trend toward higher resolutions in this paradigm. The paper does not analyze how the proposed method is applicable under different resolutions (e.g.,

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsReinforcement Learning in Robotics · Generative Adversarial Networks and Image Synthesis · Domain Adaptation and Few-Shot Learning