TL;DR
This paper introduces a data-efficient LLM agent system capable of complex spatial reasoning in warehouse environments, outperforming existing models in accuracy and efficiency for spatial question answering tasks.
Contribution
The paper presents a novel LLM agent framework with integrated tools for advanced spatial reasoning in warehouse scenarios, reducing reliance on extensive fine-tuning.
Findings
High accuracy in object retrieval, counting, and distance estimation
Effective spatial reasoning in complex indoor warehouse scenarios
Outperforms existing methods on the 2025 AI City Challenge dataset
Abstract
Spatial understanding has been a challenging task for existing Multi-modal Large Language Models~(MLLMs). Previous methods leverage large-scale MLLM finetuning to enhance MLLM's spatial understanding ability. In this paper, we present a data-efficient approach. We propose a LLM agent system with strong and advanced spatial reasoning ability, which can be used to solve the challenging spatial question answering task in complex indoor warehouse scenarios. Our system integrates multiple tools that allow the LLM agent to conduct spatial reasoning and API tools interaction to answer the given complicated spatial question. Extensive evaluations on the 2025 AI City Challenge Physical AI Spatial Intelligence Warehouse dataset demonstrate that our system achieves high accuracy and efficiency in tasks such as object retrieval, counting, and distance estimation. The code is available at:…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
