AffordBot: 3D Fine-grained Embodied Reasoning via Multimodal Large Language Models

Xinyi Wang; Xun Yang; Yanlong Xu; Yuchen Wu; Zhen Li; Na Zhao

arXiv:2511.10017·cs.CV·November 14, 2025

AffordBot: 3D Fine-grained Embodied Reasoning via Multimodal Large Language Models

Xinyi Wang, Xun Yang, Yanlong Xu, Yuchen Wu, Zhen Li, Na Zhao

PDF

Open Access

TL;DR

This paper introduces AffordBot, a framework that combines multimodal large language models with a chain-of-thought reasoning process to enable fine-grained, instruction-driven understanding of 3D scenes for embodied agents.

Contribution

It proposes a novel task of fine-grained 3D embodied reasoning and develops AffordBot, integrating MLLMs with scene rendering and active perception for improved affordance understanding.

Findings

01

Achieves state-of-the-art results on SceneFun3D dataset

02

Demonstrates strong generalization with only 3D point cloud input

03

Shows effective physically grounded reasoning in 3D scenes

Abstract

Effective human-agent collaboration in physical environments requires understanding not only what to act upon, but also where the actionable elements are and how to interact with them. Existing approaches often operate at the object level or disjointedly handle fine-grained affordance reasoning, lacking coherent, instruction-driven grounding and reasoning. In this work, we introduce a new task: Fine-grained 3D Embodied Reasoning, which requires an agent to predict, for each referenced affordance element in a 3D scene, a structured triplet comprising its spatial location, motion type, and motion axis, based on a task instruction. To solve this task, we propose AffordBot, a novel framework that integrates Multimodal Large Language Models (MLLMs) with a tailored chain-of-thought (CoT) reasoning paradigm. To bridge the gap between 3D input and 2D-compatible MLLMs, we render surround-view…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsMultimodal Machine Learning Applications · Robot Manipulation and Learning · Human Motion and Animation