2D or 3D: Who Governs Salience in VLA Models? -- Tri-Stage Token Pruning Framework with Modality Salience Awareness

Zihao Zheng; Sicheng Tian; Zhihao Mao; Lingyue Zhang; Chenyue Li; Ziyun Zhang; Hong Gao; Yuchen Huang; Yutong Xu; Guojie Luo; Xiang Chen

arXiv:2604.09244·cs.MM·April 21, 2026

2D or 3D: Who Governs Salience in VLA Models? -- Tri-Stage Token Pruning Framework with Modality Salience Awareness

Zihao Zheng, Sicheng Tian, Zhihao Mao, Lingyue Zhang, Chenyue Li, Ziyun Zhang, Hong Gao, Yuchen Huang, Yutong Xu, Guojie Luo, Xiang Chen

PDF

TL;DR

This paper introduces a tri-stage token pruning framework for multi-visual-modal VLA models that efficiently balances 2D and 3D modality salience, boosting inference speed with minimal accuracy loss.

Contribution

It proposes a novel tri-stage analysis and token pruning framework specifically designed for 2D/3D VLA models, addressing modality salience differences.

Findings

01

Achieves up to 2.55x inference speedup

02

Maintains accuracy with minimal loss

03

Cost only 5.8% overhead

Abstract

Vision-Language-Action (VLA) models have emerged as the mainstream of embodied intelligence. Recent VLA models have expanded their input modalities from 2D-only to 2D+3D paradigms, forming multi-visual-modal VLA (MVLA) models. Despite achieving improved spatial perception, MVLA faces a greater acceleration demand due to the increased number of input tokens caused by modal expansion. Token pruning is an effective optimization methods tailored to MVLA models. However, existing token pruning schemes are designed for 2D-only VLA models, ignoring 2D/3D modality salience differences. In this paper, we follow the application process of multi-modal data in MVLA models and develop a tri-stage analysis to capture the discrepancy and dynamics of 2D/3D modality salience. Based on these, we propose a corresponding tri-stage token pruning framework for MVLA models to achieve optimal 2D/3D token…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.