AffordanceLLM: Grounding Affordance from Vision Language Models

Shengyi Qian; Weifeng Chen; Min Bai; Xiong Zhou; Zhuowen Tu; Li Erran; Li

arXiv:2401.06341·cs.CV·April 19, 2024·1 cites

AffordanceLLM: Grounding Affordance from Vision Language Models

Shengyi Qian, Weifeng Chen, Min Bai, Xiong Zhou, Zhuowen Tu, Li Erran, Li

PDF

Open Access

TL;DR

This paper introduces AffordanceLLM, a model leveraging large-scale vision-language models to improve affordance grounding, demonstrating enhanced generalization and performance on in-the-wild images and unseen objects and actions.

Contribution

It proposes a novel approach that utilizes pretrained vision-language models to better understand object affordances, surpassing existing methods in generalization and accuracy.

Findings

01

Significant performance improvement on AGD20K benchmark.

02

Effective grounding of affordances in unseen objects and actions.

03

Demonstrates strong generalization to internet images.

Abstract

Affordance grounding refers to the task of finding the area of an object with which one can interact. It is a fundamental but challenging task, as a successful solution requires the comprehensive understanding of a scene in multiple aspects including detection, localization, and recognition of objects with their parts, of geo-spatial configuration/layout of the scene, of 3D shapes and physics, as well as of the functionality and potential interaction of the objects and humans. Much of the knowledge is hidden and beyond the image content with the supervised labels from a limited training set. In this paper, we make an attempt to improve the generalization capability of the current affordance grounding by taking the advantage of the rich world, abstract, and human-object-interaction knowledge from pretrained large-scale vision language models. Under the AGD20K benchmark, our proposed…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsMultimodal Machine Learning Applications · Domain Adaptation and Few-Shot Learning · Robot Manipulation and Learning