PosA-VLA: Enhancing Action Generation via Pose-Conditioned Anchor Attention
Ziwen Li, Xin Wang, Hanlue Zhang, Runnan Chen, Runqi Lin, Xiao He, Han Huang, Yandong Guo, Fakhri Karray, Tongliang Liu, Mingming Gong

TL;DR
PosA-VLA introduces pose-conditioned anchor attention to improve the precision and efficiency of action generation in vision-language-action models, especially in complex environments, without auxiliary perception modules.
Contribution
It proposes a novel pose-conditioned attention mechanism that enhances focus on task-relevant regions, leading to more accurate and stable actions in embodied tasks.
Findings
Improves action precision and efficiency in robotic manipulation tasks.
Achieves robust generalization across diverse and challenging environments.
Requires no auxiliary perception modules, enabling lightweight inference.
Abstract
The Vision-Language-Action (VLA) models have demonstrated remarkable performance on embodied tasks and shown promising potential for real-world applications. However, current VLAs still struggle to produce consistent and precise target-oriented actions, as they often generate redundant or unstable motions along trajectories, limiting their applicability in time-sensitive scenarios.In this work, we attribute these redundant actions to the spatially uniform perception field of existing VLAs, which causes them to be distracted by target-irrelevant objects, especially in complex environments.To address this issue, we propose an efficient PosA-VLA framework that anchors visual attention via pose-conditioned supervision, consistently guiding the model's perception toward task-relevant regions. The pose-conditioned anchor attention mechanism enables the model to better align instruction…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsMultimodal Machine Learning Applications · Robot Manipulation and Learning · Social Robot Interaction and HRI
