AffordVLA: Injecting Affordance Representations into Vision-Language-Action Models via Implicit Feature Alignment
Weijie Kong, Zhian Su, Wei Yu, Huixu Dong

TL;DR
AffordVLA enhances vision-language-action models by implicitly integrating manipulation-centric affordance representations, leading to improved robustness and accuracy in robotic manipulation tasks without additional perception modules.
Contribution
The paper introduces AffordVLA, a novel framework that internalizes affordance perception into VLA models via implicit feature alignment, eliminating the need for explicit masks or external modules.
Findings
Achieves state-of-the-art performance in simulation and real-world tasks.
Improves manipulation success rates and training efficiency.
Effectively reshapes visual representations while maintaining inference speed.
Abstract
Recent advances in Vision-Language-Action (VLA) models have shown strong potential for general-purpose robotic manipulation. However, the visual representations of most VLA models are often dominated by global object appearance and struggle to focus on task-relevant functional interaction regions, which limits their robustness in unstructured environments. Existing affordance-based methods typically rely on explicit mask injection or external perception modules, requiring additional annotations while introducing cascading perception errors and inference overhead. To address these limitations, we propose AffordVLA, an affordance-enhanced VLA framework that internalizes manipulation-centric affordance perception into VLA visual representations through implicit representation alignment. Specifically, we construct a zero-shot affordance teacher to extract task-conditioned affordance visual…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
