Efficient Long-Horizon Vision-Language-Action Models via Static-Dynamic Disentanglement
Weikang Qiu, Tinglin Huang, Rex Ying

TL;DR
This paper introduces SD-VLA, a framework that improves long-horizon vision-language-action models by disentangling static and dynamic visual information, leading to significant efficiency gains and better long-term task performance.
Contribution
The paper proposes a novel static-dynamic disentanglement method for VLAs, reducing context size and inference complexity, and introduces a new benchmark for long-horizon temporal dependency evaluation.
Findings
Outperforms baselines with 39.8% success rate improvement on the new benchmark.
Achieves 3.9% higher success rate on the SimplerEnv benchmark.
Delivers a 2.26x inference speedup over the base VLA model.
Abstract
Vision-Language-Action (VLA) models have recently emerged as a promising paradigm for generalist robotic control. Built upon vision-language model (VLM) architectures, VLAs predict actions conditioned on visual observations and language instructions, achieving strong performance and generalization across tasks. However, VLAs face two major challenges: limited long-horizon context and inefficient inference due to the quadratic attention complexity and large parameter counts. Our work is motivated by the observation that much of the visual information in a trajectory remains static across timesteps (e.g., the background). Leveraging this property, we propose SD-VLA, a framework that disentangles visual inputs into multi-level static and dynamic tokens, which enables (1) retaining a single copy of static tokens across frames to significantly reduce context length, and (2) reusing the…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsMultimodal Machine Learning Applications · Advanced Neural Network Applications · Domain Adaptation and Few-Shot Learning
