TL;DR
This paper introduces STEP, a novel spatiotemporal consistency prediction method that accelerates diffusion-based visuomotor policies, achieving higher success rates with lower latency in robotic manipulation tasks.
Contribution
STEP provides a lightweight warm-start mechanism and velocity-aware perturbation to improve diffusion policy efficiency without sacrificing quality.
Findings
STEP with 2 steps outperforms BRIDGER and DDIM in success rate.
Achieves 21.6% and 27.5% higher success rates on benchmarks and real tasks.
Demonstrates improved latency-success trade-off over existing methods.
Abstract
Diffusion policies have recently emerged as a powerful paradigm for visuomotor control in robotic manipulation due to their ability to model the distribution of action sequences and capture multimodality. However, iterative denoising leads to substantial inference latency, limiting control frequency in real-time closed-loop systems. Existing acceleration methods either reduce sampling steps, bypass diffusion through direct prediction, or reuse past actions, but often struggle to jointly preserve action quality and achieve consistently low latency. In this work, we propose STEP, a lightweight spatiotemporal consistency prediction mechanism to construct high-quality warm-start actions that are both distributionally close to the target action and temporally consistent, without compromising the generative capability of the original diffusion policy. Then, we propose a velocity-aware…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
