SpatiO: Adaptive Test-Time Orchestration of Vision-Language Agents for Spatial Reasoning
Chan Yeong Hwang, Miso Choi, Sunghyun On, Jinkyu Kim, Jungbeom Lee

TL;DR
SpatiO introduces a multi-agent framework with adaptive test-time orchestration for improved spatial reasoning in vision-language tasks, dynamically leveraging diverse inductive biases without retraining.
Contribution
The paper presents SpatiO, a heterogeneous multi-agent system with a novel test-time orchestration mechanism for adaptive spatial reasoning.
Findings
SpatiO outperforms baseline models on multiple spatial reasoning benchmarks.
Dynamic agent reweighting improves reasoning accuracy across diverse contexts.
Heterogeneous agents effectively leverage complementary inductive biases.
Abstract
Understanding visual scenes requires not only recognizing objects but also reasoning about their spatial relationships. Unlike general vision-language tasks, spatial reasoning requires integrating multiple inductive biases, such as 2D appearance cues, depth signals, and geometric constraints, whose reliability varies across contexts. This suggests that effective spatial reasoning requires \emph{spatial adaptability}: the ability to flexibly coordinate different reasoning strategies depending on the input. However, most existing approaches rely on a single reasoning pipeline that implicitly learns a fixed spatial prior, limiting their ability to adapt under distribution changes. Multi-agent systems offer a promising alternative by aggregating diverse reasoning trajectories, but prior attempts in spatial reasoning primarily employ homogeneous agents, restricting the diversity of inductive…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
