FocusVLA: Focused Visual Utilization for Vision-Language-Action Models
Yichi Zhang, Weihao Yuan, Yizhuo Zhang, Xidong Zhang, Jia Wan

TL;DR
FocusVLA introduces a novel approach that enhances vision-language-action models by directing attention to task-relevant visual regions, significantly improving performance and convergence in robotic manipulation tasks.
Contribution
The paper proposes FocusVLA, a new paradigm with Modality Cascaded Attention and Focus Attention mechanisms to better utilize visual details for action generation.
Findings
FocusVLA improves task performance in robotic benchmarks.
It accelerates convergence across various tasks.
It effectively suppresses task-irrelevant visual noise.
Abstract
Vision-Language-Action (VLA) models improve action generation by conditioning policies on rich vision-language information. However, current auto-regressive policies are constrained by three bottlenecks: (1) architectural bias drives models to overlook visual details, (2) an excessive number of visual tokens makes attention difficult to focus on the correct regions, and (3) task-irrelevant visual information introduces substantial noise - together severely impairing the quality of action. In this paper, we investigate how to effectively utilize different visual representations for action generation. To this end, we first empirically validate the above issues and show that VLA performance is primarily limited by how visual information is utilized, rather than by the quality of visual representations. Based on these insights, we introduce FocusVLA, a novel paradigm that directs the…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
