SeqAfford: Sequential 3D Affordance Reasoning via Multimodal Large   Language Model

Chunlin Yu; Hanqing Wang; Ye Shi; Haoyang Luo; Sibei Yang; Jingyi Yu,; Jingya Wang

arXiv:2412.01550·cs.CV·March 24, 2025

SeqAfford: Sequential 3D Affordance Reasoning via Multimodal Large Language Model

Chunlin Yu, Hanqing Wang, Ye Shi, Haoyang Luo, Sibei Yang, Jingyi Yu,, Jingya Wang

PDF

Open Access

TL;DR

SeqAfford introduces a novel 3D affordance reasoning framework that interprets complex user instructions into sequential segmentation maps, leveraging multimodal large language models and a new benchmark to enhance robotic manipulation capabilities.

Contribution

The paper presents the first instruction-based 3D affordance segmentation benchmark and a multimodal large language model, SeqAfford, for sequential reasoning and segmentation of 3D objects.

Findings

01

SeqAfford outperforms existing methods in 3D affordance segmentation.

02

The model demonstrates strong generalization to open-world scenarios.

03

It effectively decomposes complex instructions into sequential affordance segments.

Abstract

3D affordance segmentation aims to link human instructions to touchable regions of 3D objects for embodied manipulations. Existing efforts typically adhere to single-object, single-affordance paradigms, where each affordance type or explicit instruction strictly corresponds to a specific affordance region and are unable to handle long-horizon tasks. Such a paradigm cannot actively reason about complex user intentions that often imply sequential affordances. In this paper, we introduce the Sequential 3D Affordance Reasoning task, which extends the traditional paradigm by reasoning from cumbersome user intentions and then decomposing them into a series of segmentation maps. Toward this, we construct the first instruction-based affordance segmentation benchmark that includes reasoning over both single and sequential affordances, comprising 180K instruction-point cloud pairs. Based on the…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsNatural Language Processing Techniques · Human Motion and Animation · Handwritten Text Recognition Techniques