AffordVLA: Injecting Affordance Representations into Vision-Language-Action Models via Implicit Feature Alignment

Weijie Kong; Zhian Su; Wei Yu; Huixu Dong

arXiv:2605.17517·cs.RO·May 19, 2026

AffordVLA: Injecting Affordance Representations into Vision-Language-Action Models via Implicit Feature Alignment

Weijie Kong, Zhian Su, Wei Yu, Huixu Dong

PDF

TL;DR

AffordVLA enhances vision-language-action models by implicitly integrating manipulation-centric affordance representations, leading to improved robustness and accuracy in robotic manipulation tasks without additional perception modules.

Contribution

The paper introduces AffordVLA, a novel framework that internalizes affordance perception into VLA models via implicit feature alignment, eliminating the need for explicit masks or external modules.

Findings

01

Achieves state-of-the-art performance in simulation and real-world tasks.

02

Improves manipulation success rates and training efficiency.

03

Effectively reshapes visual representations while maintaining inference speed.

Abstract

Recent advances in Vision-Language-Action (VLA) models have shown strong potential for general-purpose robotic manipulation. However, the visual representations of most VLA models are often dominated by global object appearance and struggle to focus on task-relevant functional interaction regions, which limits their robustness in unstructured environments. Existing affordance-based methods typically rely on explicit mask injection or external perception modules, requiring additional annotations while introducing cascading perception errors and inference overhead. To address these limitations, we propose AffordVLA, an affordance-enhanced VLA framework that internalizes manipulation-centric affordance perception into VLA visual representations through implicit representation alignment. Specifically, we construct a zero-shot affordance teacher to extract task-conditioned affordance visual…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.