STEP: Warm-Started Visuomotor Policies with Spatiotemporal Consistency Prediction

Jinhao Li; Yuxuan Cong; Yingqiao Wang; Hao Xia; Shan Huang; Yijia Zhang; Ningyi Xu; Guohao Dai

arXiv:2602.08245·cs.RO·May 5, 2026

STEP: Warm-Started Visuomotor Policies with Spatiotemporal Consistency Prediction

Jinhao Li, Yuxuan Cong, Yingqiao Wang, Hao Xia, Shan Huang, Yijia Zhang, Ningyi Xu, Guohao Dai

PDF

1 Repo

TL;DR

This paper introduces STEP, a novel spatiotemporal consistency prediction method that accelerates diffusion-based visuomotor policies, achieving higher success rates with lower latency in robotic manipulation tasks.

Contribution

STEP provides a lightweight warm-start mechanism and velocity-aware perturbation to improve diffusion policy efficiency without sacrificing quality.

Findings

01

STEP with 2 steps outperforms BRIDGER and DDIM in success rate.

02

Achieves 21.6% and 27.5% higher success rates on benchmarks and real tasks.

03

Demonstrates improved latency-success trade-off over existing methods.

Abstract

Diffusion policies have recently emerged as a powerful paradigm for visuomotor control in robotic manipulation due to their ability to model the distribution of action sequences and capture multimodality. However, iterative denoising leads to substantial inference latency, limiting control frequency in real-time closed-loop systems. Existing acceleration methods either reduce sampling steps, bypass diffusion through direct prediction, or reuse past actions, but often struggle to jointly preserve action quality and achieve consistently low latency. In this work, we propose STEP, a lightweight spatiotemporal consistency prediction mechanism to construct high-quality warm-start actions that are both distributionally close to the target action and temporally consistent, without compromising the generative capability of the original diffusion policy. Then, we propose a velocity-aware…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

Kimho666/STEP
github

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.