TL;DR
SnapFlow is a self-distillation technique that compresses multi-step flow-matching models into a single-step model, significantly reducing inference latency while maintaining high success rates in robotic manipulation tasks.
Contribution
It introduces a novel self-distillation method that enables one-step action generation for flow-matching VLAs without external teachers or architecture changes.
Findings
Achieves 98.75% success on LIBERO tasks, matching multi-step models.
Reduces end-to-end latency from 274ms to 83ms, 3.3x faster.
Maintains performance across long-horizon tasks with 93% success at 5 actions.
Abstract
Vision-Language-Action (VLA) models based on flow matching -- such as pi0, pi0.5, and SmolVLA -- achieve state-of-the-art generalist robotic manipulation, yet their iterative denoising, typically 10 ODE steps, introduces substantial latency: on a modern GPU, denoising alone accounts for 80% of end-to-end inference time. Naively reducing the step count is unreliable, degrading success on most tasks due to the velocity field being uncalibrated for single-step jumps. We present SnapFlow, a plug-and-play self-distillation method that compresses multi-step denoising into a single forward pass (1-NFE) for flow-matching VLAs. SnapFlow mixes standard flow-matching samples with consistency samples whose targets are two-step Euler shortcut velocities computed from the model's own marginal velocity predictions, avoiding the trajectory drift caused by conditional velocities, as we analyze…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
