Look Before Acting: Enhancing Vision Foundation Representations for Vision-Language-Action Models

Yulin Luo; Hao Chen; Zhuangzhe Wu; Bowen Sui; Jiaming Liu; Chenyang Gu; Zhuoyang Liu; Qiuxuan Feng; Jiale Yu; Shuo Gu; Peng Jia; Pheng-Ann Heng; Shanghang Zhang

arXiv:2603.15618·cs.CV·March 18, 2026

Look Before Acting: Enhancing Vision Foundation Representations for Vision-Language-Action Models

Yulin Luo, Hao Chen, Zhuangzhe Wu, Bowen Sui, Jiaming Liu, Chenyang Gu, Zhuoyang Liu, Qiuxuan Feng, Jiale Yu, Shuo Gu, Peng Jia, Pheng-Ann Heng, Shanghang Zhang

PDF

Open Access

TL;DR

This paper introduces DeepVision-VLA, a framework that enhances visual representations in vision-language-action models for robotic manipulation by integrating multi-level visual features and pruning irrelevant tokens, leading to improved performance.

Contribution

It proposes a novel VL-MoT framework with shared attention and visual pruning techniques to improve visual grounding in VLA models, with significant performance gains.

Findings

01

DeepVision-VLA outperforms previous methods by 9.0% and 7.5% on simulated and real tasks.

02

Sensitivity to visual tokens decreases in deeper layers during action generation.

03

Shared attention and visual pruning improve manipulation accuracy.

Abstract

Vision-Language-Action (VLA) models have recently emerged as a promising paradigm for robotic manipulation, in which reliable action prediction critically depends on accurately interpreting and integrating visual observations conditioned on language instructions. Although recent works have sought to enhance the visual capabilities of VLA models, most approaches treat the LLM backbone as a black box, providing limited insight into how visual information is grounded into action generation. Therefore, we perform a systematic analysis of multiple VLA models across different action-generation paradigms and observe that sensitivity to visual tokens progressively decreases in deeper layers during action generation. Motivated by this observation, we propose \textbf{DeepVision-VLA}, built on a \textbf{Vision-Language Mixture-of-Transformers (VL-MoT)} framework. This framework enables shared…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsMultimodal Machine Learning Applications · Robot Manipulation and Learning · Generative Adversarial Networks and Image Synthesis