Premover: Fast Vision-Language-Action Control by Acting Before Instructions Are Complete
Joonha Park, Jiseung Jeong, Taesik Gong

TL;DR
Premover enables vision-language-action policies to start acting earlier by precomputing during user input delays, significantly reducing response time without sacrificing success rate.
Contribution
Introduces Premover, a lightweight module that leverages idle time for precomputation, improving efficiency of VLA policies during user input delays.
Findings
Premover reduces mean wall-clock time by 13.6% on LIBERO benchmark.
Premover maintains high success rate comparable to full-prompt baseline.
Naive premoving drastically decreases performance, showing the effectiveness of Premover's approach.
Abstract
Vision-Language-Action (VLA) policies are typically evaluated as if the user had finished typing or speaking before the robot begins acting. In real deployment, however, users take several seconds to enter a request, leaving the policy idle for a substantial fraction of the interaction. We introduce Premover, a lightweight module that converts this idle window into useful precomputation. Premover keeps the VLA backbone frozen and attaches two small projection heads, one for image patches, one for language tokens, that map an intermediate layer of the backbone into a shared space. The resulting focus map is supervised by simulator-rendered target-object segmentation masks and applied as a per-patch reweighting of the next step's image tokens. A single scalar readiness threshold, trained jointly from streaming prefixes, decides when the policy should begin acting. On the LIBERO benchmark…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
