AMIGO: Agentic Multi-Image Grounding Oracle Benchmark

Min Wang; Ata Mahjoubfar

arXiv:2603.28662·cs.LG·March 31, 2026

AMIGO: Agentic Multi-Image Grounding Oracle Benchmark

Min Wang, Ata Mahjoubfar

PDF

TL;DR

AMIGO is a new benchmark for evaluating vision-language models in multi-image, multi-turn scenarios involving complex question-answer interactions and robustness testing.

Contribution

It introduces a long-horizon, multi-image grounding benchmark with a protocol for question asking, evidence verification, and robustness analysis.

Findings

01

Models achieve measurable success in identifying target images.

02

AMIGO reveals strengths and weaknesses in question selection and robustness.

03

Benchmark provides comprehensive metrics for interaction quality and noise tolerance.

Abstract

Agentic vision-language models increasingly act through extended interactions, but most evaluations still focus on single-image, single-turn correctness. We introduce AMIGO (Agentic Multi-Image Grounding Oracle Benchmark), a long-horizon benchmark for hidden-target identification over galleries of visually similar images. In AMIGO, the oracle privately selects a target image, and the model must recover it by asking a sequence of attribute-focused Yes/No/Unsure questions under a strict protocol that penalizes invalid actions with Skip. This setting stresses (i) question selection under uncertainty, (ii) consistent constraint tracking across turns, and (iii) fine-grained discrimination as evidence accumulates. AMIGO also supports controlled oracle imperfections to probe robustness and verification behavior under inconsistent feedback. We instantiate AMIGO with Guess My Preferred Dress…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.