MAG-3D: Multi-Agent Grounded Reasoning for 3D Understanding

Henry Zheng; Chenyue Fang; Rui Huang; Siyuan Wei; Xiao Liu; Gao Huang

arXiv:2604.09167·cs.CV·April 13, 2026

MAG-3D: Multi-Agent Grounded Reasoning for 3D Understanding

Henry Zheng, Chenyue Fang, Rui Huang, Siyuan Wei, Xiao Liu, Gao Huang

PDF

TL;DR

MAG-3D introduces a training-free multi-agent framework leveraging off-the-shelf vision-language models for flexible, zero-shot grounded reasoning in complex 3D scenes, surpassing prior methods.

Contribution

The paper presents MAG-3D, a novel multi-agent system that enables training-free, flexible 3D grounded reasoning using existing vision-language models, without task-specific tuning.

Findings

01

Achieves state-of-the-art results on 3D reasoning benchmarks.

02

Demonstrates effective zero-shot generalization across diverse scenes.

03

Enables flexible, training-free reasoning without hand-crafted pipelines.

Abstract

Vision-language models (VLMs) have achieved strong performance in multimodal understanding and reasoning, yet grounded reasoning in 3D scenes remains underexplored. Effective 3D reasoning hinges on accurate grounding: to answer open-ended queries, a model must first identify query-relevant objects and regions in a complex scene, and then reason about their spatial and geometric relationships. Recent approaches have demonstrated strong potential for grounded 3D reasoning. However, they often rely on in-domain tuning or hand-crafted reasoning pipelines, which limit their flexibility and zero-shot generalization to novel environments. In this work, we present MAG-3D, a training-free multi-agent framework for grounded 3D reasoning with off-the-shelf VLMs. Instead of relying on task-specific training or fixed reasoning procedures, MAG-3D dynamically coordinates expert agents to address the…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.