WorldAfford: Affordance Grounding based on Natural Language Instructions

Changmao Chen; Yuren Cong; Zhen Kan

arXiv:2405.12461·cs.CV·May 22, 2024

WorldAfford: Affordance Grounding based on Natural Language Instructions

Changmao Chen, Yuren Cong, Zhen Kan

PDF

Open Access

TL;DR

This paper introduces WorldAfford, a novel framework for affordance grounding based on complex natural language instructions, enabling localization of multiple object affordance regions in complex scenes, which surpasses previous simple-label methods.

Contribution

The paper presents a new task, dataset, and framework that incorporate natural language instructions and reasoning for affordance grounding, addressing limitations of prior object-centric approaches.

Findings

01

WorldAfford achieves state-of-the-art results on AGD20K and LLMaFF datasets.

02

It can localize affordance regions of multiple objects in complex scenes.

03

The framework provides an alternative when object instructions do not fully match environment objects.

Abstract

Affordance grounding aims to localize the interaction regions for the manipulated objects in the scene image according to given instructions. A critical challenge in affordance grounding is that the embodied agent should understand human instructions and analyze which tools in the environment can be used, as well as how to use these tools to accomplish the instructions. Most recent works primarily supports simple action labels as input instructions for localizing affordance regions, failing to capture complex human objectives. Moreover, these approaches typically identify affordance regions of only a single object in object-centric images, ignoring the object context and struggling to localize affordance regions of multiple objects in complex scenes for practical applications. To address this concern, for the first time, we introduce a new task of affordance grounding based on natural…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsNatural Language Processing Techniques

MethodsContrastive Language-Image Pre-training · Segment Anything Model