Egocentric Instruction-oriented Affordance Prediction via Large Multimodal Model

Bokai Ji; Jie Gu; Xiaokang Ma; Chu Tang; Jingmin Chen; Guangxia Li

arXiv:2508.17922·cs.RO·August 26, 2025

Egocentric Instruction-oriented Affordance Prediction via Large Multimodal Model

Bokai Ji, Jie Gu, Xiaokang Ma, Chu Tang, Jingmin Chen, Guangxia Li

PDF

TL;DR

This paper introduces a new egocentric dataset and a novel multimodal model approach for instruction-dependent affordance prediction, enabling robots to better understand and manipulate objects based on specific tasks.

Contribution

The paper presents a large egocentric dataset of object-instruction-affordance triplets and a search against verifiers pipeline for large multimodal models to predict affordances based on instructions.

Findings

01

Achieves instruction-dependent affordance prediction.

02

Demonstrates superior performance over existing methods.

03

Enables reasoning-like iterative affordance prediction.

Abstract

Affordance is crucial for intelligent robots in the context of object manipulation. In this paper, we argue that affordance should be task-/instruction-dependent, which is overlooked by many previous works. That is, different instructions can lead to different manipulation regions and directions even for the same object. According to this observation, we present a new dataset comprising fifteen thousand object-instruction-affordance triplets. All scenes in the dataset are from an egocentric viewpoint, designed to approximate the perspective of a human-like robot. Furthermore, we investigate how to enable large multimodal models (LMMs) to serve as affordance predictors by implementing a ``search against verifiers'' pipeline. An LMM is asked to progressively predict affordances, with the output at each step being verified by itself during the iterative process, imitating a reasoning…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.