MAG-3D: Multi-Agent Grounded Reasoning for 3D Understanding
Henry Zheng, Chenyue Fang, Rui Huang, Siyuan Wei, Xiao Liu, Gao Huang

TL;DR
MAG-3D introduces a training-free multi-agent framework leveraging off-the-shelf vision-language models for flexible, zero-shot grounded reasoning in complex 3D scenes, surpassing prior methods.
Contribution
The paper presents MAG-3D, a novel multi-agent system that enables training-free, flexible 3D grounded reasoning using existing vision-language models, without task-specific tuning.
Findings
Achieves state-of-the-art results on 3D reasoning benchmarks.
Demonstrates effective zero-shot generalization across diverse scenes.
Enables flexible, training-free reasoning without hand-crafted pipelines.
Abstract
Vision-language models (VLMs) have achieved strong performance in multimodal understanding and reasoning, yet grounded reasoning in 3D scenes remains underexplored. Effective 3D reasoning hinges on accurate grounding: to answer open-ended queries, a model must first identify query-relevant objects and regions in a complex scene, and then reason about their spatial and geometric relationships. Recent approaches have demonstrated strong potential for grounded 3D reasoning. However, they often rely on in-domain tuning or hand-crafted reasoning pipelines, which limit their flexibility and zero-shot generalization to novel environments. In this work, we present MAG-3D, a training-free multi-agent framework for grounded 3D reasoning with off-the-shelf VLMs. Instead of relying on task-specific training or fixed reasoning procedures, MAG-3D dynamically coordinates expert agents to address the…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
