FUTURE-VLA: Forecasting Unified Trajectories Under Real-time Execution
Jingjing Fan, Yushan Liu, Shoujie Li, Botao Ren, Siyuan Li, Xiao-Ping Zhang, Wenbo Ding, Zhidong Deng

TL;DR
FUTURE-VLA introduces a real-time, unified vision-language model for long-horizon control and future forecasting in robotics, achieving state-of-the-art success rates with minimal latency by using adaptive compression and latent autoregression.
Contribution
It presents a novel architecture that reformulates long-horizon control as a sequence-generation task, enabling efficient, real-time spatiotemporal reasoning for robotics applications.
Findings
Achieves 99.2% success on LIBERO
Attains 75.4% success on RoboTwin
Maintains latency comparable to single-frame models
Abstract
General vision-language models increasingly support unified spatiotemporal reasoning over long video streams, yet deploying such capabilities on robots remains constrained by the prohibitive latency of processing long-horizon histories and generating high-dimensional future predictions. To bridge this gap, we present FUTURE-VLA, a unified architecture that reformulates long-horizon control and future forecasting as a monolithic sequence-generation task. Adopting a dual-sided efficiency paradigm, FUTURE-VLA leverages a temporally adaptive compression strategy to maximize spatiotemporal information density, enabling the ingestion of extensive multi-view histories while maintaining constant inference latency. Simultaneously, it performs latent-space autoregression to align actionable dynamics with reviewable visual look-aheads in a single forward pass. These real-time predictive…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsMultimodal Machine Learning Applications · Autonomous Vehicle Technology and Safety · Generative Adversarial Networks and Image Synthesis
