Flame3D: Zero-shot Compositional Reasoning of 3D Scenes with Agentic Language Models
Sagar Bharadwaj, Ziyong Ma, Anurag Ghosh, Srinivasan Seshan, Anthony Rowe

TL;DR
Flame3D is a training-free framework enabling zero-shot, compositional reasoning about 3D scenes by using external tools and large language models, without requiring 3D-specific training.
Contribution
It introduces Flame3D, a novel inference-time approach that constructs editable 3D scene memories and synthesizes spatial programs for open-ended reasoning.
Findings
Competitive performance on ScanQA without training
Essential role of synthesized spatial operations in reasoning
Effective reasoning over layouts and objects not in the scene
Abstract
3D scene understanding spans reasoning about free space, object grounding, hypothetical object insertions, complex geometric relationships, and integrating all of these with external tools and data sources. Existing 3D understanding methods typically rely on large-scale 3D-language training or focus on object grounding and simple spatial relationships. We argue that the broad generalization that motivates 3D-language training can be achieved at inference time, without 3D-specific training. We propose Flame3D, a training-free framework that represents scenes as editable visual-textual 3D memories and exposes them to an off-the-shelf MLLM through composable spatial tools. Flame3D also lets the agent synthesize custom spatial programs at inference time, enabling open-ended reasoning over layouts, empty space, and objects not yet present in the scene. External data and corrections can be…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
