Efficient Long-Horizon Vision-Language-Action Models via Static-Dynamic Disentanglement

Weikang Qiu; Tinglin Huang; Rex Ying

arXiv:2602.03983·cs.RO·February 17, 2026

Efficient Long-Horizon Vision-Language-Action Models via Static-Dynamic Disentanglement

Weikang Qiu, Tinglin Huang, Rex Ying

PDF

Open Access

TL;DR

This paper introduces SD-VLA, a framework that improves long-horizon vision-language-action models by disentangling static and dynamic visual information, leading to significant efficiency gains and better long-term task performance.

Contribution

The paper proposes a novel static-dynamic disentanglement method for VLAs, reducing context size and inference complexity, and introduces a new benchmark for long-horizon temporal dependency evaluation.

Findings

01

Outperforms baselines with 39.8% success rate improvement on the new benchmark.

02

Achieves 3.9% higher success rate on the SimplerEnv benchmark.

03

Delivers a 2.26x inference speedup over the base VLA model.

Abstract

Vision-Language-Action (VLA) models have recently emerged as a promising paradigm for generalist robotic control. Built upon vision-language model (VLM) architectures, VLAs predict actions conditioned on visual observations and language instructions, achieving strong performance and generalization across tasks. However, VLAs face two major challenges: limited long-horizon context and inefficient inference due to the quadratic attention complexity and large parameter counts. Our work is motivated by the observation that much of the visual information in a trajectory remains static across timesteps (e.g., the background). Leveraging this property, we propose SD-VLA, a framework that disentangles visual inputs into multi-level static and dynamic tokens, which enables (1) retaining a single copy of static tokens across frames to significantly reduce context length, and (2) reusing the…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsMultimodal Machine Learning Applications · Advanced Neural Network Applications · Domain Adaptation and Few-Shot Learning