SeqAfford: Sequential 3D Affordance Reasoning via Multimodal Large Language Model
Chunlin Yu, Hanqing Wang, Ye Shi, Haoyang Luo, Sibei Yang, Jingyi Yu,, Jingya Wang

TL;DR
SeqAfford introduces a novel 3D affordance reasoning framework that interprets complex user instructions into sequential segmentation maps, leveraging multimodal large language models and a new benchmark to enhance robotic manipulation capabilities.
Contribution
The paper presents the first instruction-based 3D affordance segmentation benchmark and a multimodal large language model, SeqAfford, for sequential reasoning and segmentation of 3D objects.
Findings
SeqAfford outperforms existing methods in 3D affordance segmentation.
The model demonstrates strong generalization to open-world scenarios.
It effectively decomposes complex instructions into sequential affordance segments.
Abstract
3D affordance segmentation aims to link human instructions to touchable regions of 3D objects for embodied manipulations. Existing efforts typically adhere to single-object, single-affordance paradigms, where each affordance type or explicit instruction strictly corresponds to a specific affordance region and are unable to handle long-horizon tasks. Such a paradigm cannot actively reason about complex user intentions that often imply sequential affordances. In this paper, we introduce the Sequential 3D Affordance Reasoning task, which extends the traditional paradigm by reasoning from cumbersome user intentions and then decomposing them into a series of segmentation maps. Toward this, we construct the first instruction-based affordance segmentation benchmark that includes reasoning over both single and sequential affordances, comprising 180K instruction-point cloud pairs. Based on the…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsNatural Language Processing Techniques · Human Motion and Animation · Handwritten Text Recognition Techniques
