SpatiO: Adaptive Test-Time Orchestration of Vision-Language Agents for Spatial Reasoning

Chan Yeong Hwang; Miso Choi; Sunghyun On; Jinkyu Kim; Jungbeom Lee

arXiv:2604.21190·cs.CV·April 29, 2026

SpatiO: Adaptive Test-Time Orchestration of Vision-Language Agents for Spatial Reasoning

Chan Yeong Hwang, Miso Choi, Sunghyun On, Jinkyu Kim, Jungbeom Lee

PDF

TL;DR

SpatiO introduces a multi-agent framework with adaptive test-time orchestration for improved spatial reasoning in vision-language tasks, dynamically leveraging diverse inductive biases without retraining.

Contribution

The paper presents SpatiO, a heterogeneous multi-agent system with a novel test-time orchestration mechanism for adaptive spatial reasoning.

Findings

01

SpatiO outperforms baseline models on multiple spatial reasoning benchmarks.

02

Dynamic agent reweighting improves reasoning accuracy across diverse contexts.

03

Heterogeneous agents effectively leverage complementary inductive biases.

Abstract

Understanding visual scenes requires not only recognizing objects but also reasoning about their spatial relationships. Unlike general vision-language tasks, spatial reasoning requires integrating multiple inductive biases, such as 2D appearance cues, depth signals, and geometric constraints, whose reliability varies across contexts. This suggests that effective spatial reasoning requires \emph{spatial adaptability}: the ability to flexibly coordinate different reasoning strategies depending on the input. However, most existing approaches rely on a single reasoning pipeline that implicitly learns a fixed spatial prior, limiting their ability to adapt under distribution changes. Multi-agent systems offer a promising alternative by aggregating diverse reasoning trajectories, but prior attempts in spatial reasoning primarily employ homogeneous agents, restricting the diversity of inductive…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.