ARGOS: Who, Where, and When in Agentic Multi-Camera Person Search
Myungchul Kim, Kwanyong Park, Junmo Kim, and In So Kweon

TL;DR
ARGOS is a novel benchmark and framework that models multi-camera person search as an interactive reasoning task, requiring planning, questioning, and tool use under information constraints.
Contribution
It introduces the first benchmark for agentic multi-camera person search, integrating reasoning, tool use, and real-world scenarios with comprehensive evaluation.
Findings
The benchmark includes 2,691 tasks across 14 scenarios.
Current models perform far from optimal, with best TWS scores below 0.6.
Removing domain-specific tools significantly reduces accuracy.
Abstract
We introduce ARGOS, the first benchmark and framework that reformulates multi-camera person search as an interactive reasoning problem requiring an agent to plan, question, and eliminate candidates under information asymmetry. An ARGOS agent receives a vague witness statement and must decide what to ask, when to invoke spatial or temporal tools, and how to interpret ambiguous responses, all within a limited turn budget. Reasoning is grounded in a Spatio-Temporal Topology Graph (STTG) encoding camera connectivity and empirically validated transition times. The benchmark comprises 2,691 tasks across 14 real-world scenarios in three progressive tracks: semantic perception (Who), spatial reasoning (Where), and temporal reasoning (When). Experiments with four LLM backbones show the benchmark is far from solved (best TWS: 0.383 on Track 2, 0.590 on Track 3), and ablations confirm that…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
