Object Affordance Recognition and Grounding via Multi-scale Cross-modal Representation Learning

Xinhang Wan; Dongqiang Gou; Xinwang Liu; En Zhu; Xuming He

arXiv:2508.01184·cs.CV·April 2, 2026

Object Affordance Recognition and Grounding via Multi-scale Cross-modal Representation Learning

Xinhang Wan, Dongqiang Gou, Xinwang Liu, En Zhu, Xuming He

PDF

TL;DR

This paper introduces a multi-scale cross-modal learning approach for 3D object affordance recognition and grounding, improving the localization and understanding of object functionalities in embodied AI.

Contribution

It proposes a novel cross-modal 3D representation and a stage-wise inference strategy to jointly improve affordance grounding and classification tasks.

Findings

01

Enhanced accuracy in affordance grounding and classification tasks.

02

Ability to predict full potential affordance areas at appropriate scales.

03

Effective coupling of grounding and classification improves overall affordance understanding.

Abstract

A core problem of Embodied AI is to learn object manipulation from observation, as humans do. To achieve this, it is important to localize 3D object affordance areas through observation such as images (3D affordance grounding) and understand their functionalities (affordance classification). Previous attempts usually tackle these two tasks separately, leading to inconsistent predictions due to lacking proper modeling of their dependency. In addition, these methods typically only ground the incomplete affordance areas depicted in images, failing to predict the full potential affordance areas, and operate at a fixed scale, resulting in difficulty in coping with affordances significantly varying in scale with respect to the whole object. To address these issues, we propose a novel approach that learns an affordance-aware 3D representation and employs a stage-wise inference strategy…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.