Visual Agentic AI for Spatial Reasoning with a Dynamic API
Damiano Marsili, Rohun Agrawal, Yisong Yue, Georgia Gkioxari

TL;DR
This paper presents an agentic AI approach that uses dynamic API generation via program synthesis to improve 3D spatial reasoning in visual understanding tasks, outperforming static methods.
Contribution
It introduces a novel agentic program synthesis framework with a dynamic API for enhanced 3D spatial reasoning, surpassing prior static API-based models.
Findings
Outperforms prior zero-shot models in 3D visual reasoning
Introduces a new benchmark for multi-step grounding and inference
Empirically validates the effectiveness of the agentic framework
Abstract
Visual reasoning -- the ability to interpret the visual world -- is crucial for embodied agents that operate within three-dimensional scenes. Progress in AI has led to vision and language models capable of answering questions from images. However, their performance declines when tasked with 3D spatial reasoning. To tackle the complexity of such reasoning problems, we introduce an agentic program synthesis approach where LLM agents collaboratively generate a Pythonic API with new functions to solve common subproblems. Our method overcomes limitations of prior approaches that rely on a static, human-defined API, allowing it to handle a wider range of queries. To assess AI capabilities for 3D understanding, we introduce a new benchmark of queries involving multiple steps of grounding and inference. We show that our method outperforms prior zero-shot models for visual reasoning in 3D and…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsConstraint Satisfaction and Optimization · Semantic Web and Ontologies · Geographic Information Systems Studies
