GuidedVLA: Specifying Task-Relevant Factors via Plug-and-Play Action Attention Specialization

Xiaosong Jia; Bowen Yang; Zuhao Ge; Xian Nie; Yuchen Zhou; Cunxin Fan; Yufeng Li; Yilin Chai; Chao Jing; Zijian Liang; Qingwen Bu; Haidong Cao; Chao Wu; Qifeng Li; Zhenjie Yang; Chenhe Zhang; Hongyang Li; Zuxuan Wu; Junchi Yan; Yu-Gang Jiang

arXiv:2605.12369·cs.RO·May 13, 2026

GuidedVLA: Specifying Task-Relevant Factors via Plug-and-Play Action Attention Specialization

Xiaosong Jia, Bowen Yang, Zuhao Ge, Xian Nie, Yuchen Zhou, Cunxin Fan, Yufeng Li, Yilin Chai, Chao Jing, Zijian Liang, Qingwen Bu, Haidong Cao, Chao Wu, Qifeng Li, Zhenjie Yang, Chenhe Zhang, Hongyang Li, Zuxuan Wu, Junchi Yan, Yu-Gang Jiang

PDF

TL;DR

GuidedVLA introduces a modular approach to vision-language-action models by supervising individual attention heads with auxiliary signals, enhancing focus on task-relevant factors and improving generalization in robot learning.

Contribution

The paper proposes a novel framework that explicitly guides action decoder components with auxiliary signals, enabling better focus on task-relevant features and improved robustness.

Findings

01

GuidedVLA improves success rates in simulation and real-robot experiments.

02

Specialized attention heads capture distinct task-relevant factors.

03

High-quality, decoupled features correlate with better task performance.

Abstract

Vision-Language-Action (VLA) models aim for general robot learning by aligning action as a modality within powerful Vision-Language Models (VLMs). Existing VLAs rely on end-to-end supervision to implicitly enable the action decoding process to learn task-relevant features. However, without explicit guidance, these models often overfit to spurious correlations, such as visual shortcuts or environmental noise, limiting their generalization. In this paper, we introduce GuidedVLA, a framework designed to manually guide the action generation to focus on task-relevant factors. Our core insight is to treat the action decoder not as a monolithic learner, but as an assembly of functional components. Individual attention heads are supervised by manually defined auxiliary signals to capture distinct factors. As an initial study, we instantiate this paradigm with three specialized heads: object…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.