Visual Agentic AI for Spatial Reasoning with a Dynamic API

Damiano Marsili; Rohun Agrawal; Yisong Yue; Georgia Gkioxari

arXiv:2502.06787·cs.CV·March 31, 2025

Visual Agentic AI for Spatial Reasoning with a Dynamic API

Damiano Marsili, Rohun Agrawal, Yisong Yue, Georgia Gkioxari

PDF

Open Access 1 Datasets

TL;DR

This paper presents an agentic AI approach that uses dynamic API generation via program synthesis to improve 3D spatial reasoning in visual understanding tasks, outperforming static methods.

Contribution

It introduces a novel agentic program synthesis framework with a dynamic API for enhanced 3D spatial reasoning, surpassing prior static API-based models.

Findings

01

Outperforms prior zero-shot models in 3D visual reasoning

02

Introduces a new benchmark for multi-step grounding and inference

03

Empirically validates the effectiveness of the agentic framework

Abstract

Visual reasoning -- the ability to interpret the visual world -- is crucial for embodied agents that operate within three-dimensional scenes. Progress in AI has led to vision and language models capable of answering questions from images. However, their performance declines when tasked with 3D spatial reasoning. To tackle the complexity of such reasoning problems, we introduce an agentic program synthesis approach where LLM agents collaboratively generate a Pythonic API with new functions to solve common subproblems. Our method overcomes limitations of prior approaches that rely on a static, human-defined API, allowing it to handle a wider range of queries. To assess AI capabilities for 3D understanding, we introduce a new benchmark of queries involving multiple steps of grounding and inference. We show that our method outperforms prior zero-shot models for visual reasoning in 3D and…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Datasets

dmarsili/Omni3D-Bench
dataset· 359 dl
359 dl

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsConstraint Satisfaction and Optimization · Semantic Web and Ontologies · Geographic Information Systems Studies