AMIGO: Agentic Multi-Image Grounding Oracle Benchmark
Min Wang, Ata Mahjoubfar

TL;DR
AMIGO is a new benchmark for evaluating vision-language models in multi-image, multi-turn scenarios involving complex question-answer interactions and robustness testing.
Contribution
It introduces a long-horizon, multi-image grounding benchmark with a protocol for question asking, evidence verification, and robustness analysis.
Findings
Models achieve measurable success in identifying target images.
AMIGO reveals strengths and weaknesses in question selection and robustness.
Benchmark provides comprehensive metrics for interaction quality and noise tolerance.
Abstract
Agentic vision-language models increasingly act through extended interactions, but most evaluations still focus on single-image, single-turn correctness. We introduce AMIGO (Agentic Multi-Image Grounding Oracle Benchmark), a long-horizon benchmark for hidden-target identification over galleries of visually similar images. In AMIGO, the oracle privately selects a target image, and the model must recover it by asking a sequence of attribute-focused Yes/No/Unsure questions under a strict protocol that penalizes invalid actions with Skip. This setting stresses (i) question selection under uncertainty, (ii) consistent constraint tracking across turns, and (iii) fine-grained discrimination as evidence accumulates. AMIGO also supports controlled oracle imperfections to probe robustness and verification behavior under inconsistent feedback. We instantiate AMIGO with Guess My Preferred Dress…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
