ETA-VLA: Efficient Token Adaptation via Temporal Fusion and Intra-LLM Sparsification for Vision-Language-Action Models

Yiru Wang; Anqing Jiang; Shuo Wang; Yuwen Heng; Zichong Gu; Hao Sun

arXiv:2603.25766·cs.RO·March 30, 2026

ETA-VLA: Efficient Token Adaptation via Temporal Fusion and Intra-LLM Sparsification for Vision-Language-Action Models

Yiru Wang, Anqing Jiang, Shuo Wang, Yuwen Heng, Zichong Gu, Hao Sun

PDF

TL;DR

ETA-VLA introduces an efficient token adaptation method for vision-language-action models in autonomous driving, significantly reducing computational costs while maintaining high performance through dynamic token pruning.

Contribution

The paper presents ETA-VLA, a novel framework that dynamically prunes redundant tokens in LLMs for VLA, improving efficiency without sacrificing accuracy.

Findings

01

Reduces FLOPs by 32% while maintaining performance.

02

Prunes 85% of visual tokens effectively.

03

Achieves 94% of original accuracy on NAVSIM v2.

Abstract

The integration of Vision-Language-Action (VLA) models into autonomous driving systems offers a unified framework for interpreting complex scenes and executing control commands. However, the necessity to incorporate historical multi-view frames for accurate temporal reasoning imposes a severe computational burden, primarily driven by the quadratic complexity of self-attention mechanisms in Large Language Models (LLMs). To alleviate this bottleneck, we propose ETA-VLA, an Efficient Token Adaptation framework for VLA models. ETA-VLA processes the past $n$ frames of multi-view images and introduces a novel Intra-LLM Sparse Aggregator (ILSA). Drawing inspiration from human driver attention allocation, ILSA dynamically identifies and prunes redundant visual tokens guided by textual queries and temporal consistency. Specifically, we utilize a text-guided scoring mechanism alongside a…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.