WorldAfford: Affordance Grounding based on Natural Language Instructions
Changmao Chen, Yuren Cong, Zhen Kan

TL;DR
This paper introduces WorldAfford, a novel framework for affordance grounding based on complex natural language instructions, enabling localization of multiple object affordance regions in complex scenes, which surpasses previous simple-label methods.
Contribution
The paper presents a new task, dataset, and framework that incorporate natural language instructions and reasoning for affordance grounding, addressing limitations of prior object-centric approaches.
Findings
WorldAfford achieves state-of-the-art results on AGD20K and LLMaFF datasets.
It can localize affordance regions of multiple objects in complex scenes.
The framework provides an alternative when object instructions do not fully match environment objects.
Abstract
Affordance grounding aims to localize the interaction regions for the manipulated objects in the scene image according to given instructions. A critical challenge in affordance grounding is that the embodied agent should understand human instructions and analyze which tools in the environment can be used, as well as how to use these tools to accomplish the instructions. Most recent works primarily supports simple action labels as input instructions for localizing affordance regions, failing to capture complex human objectives. Moreover, these approaches typically identify affordance regions of only a single object in object-centric images, ignoring the object context and struggling to localize affordance regions of multiple objects in complex scenes for practical applications. To address this concern, for the first time, we introduce a new task of affordance grounding based on natural…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsNatural Language Processing Techniques
MethodsContrastive Language-Image Pre-training · Segment Anything Model
