DETOUR: An Interactive Benchmark for Dual-Agent Search and Reasoning
Li Siyan, Darshan Deshpande, Anand Kannappan, Rebecca Qian

TL;DR
DETOUR is a new benchmark for evaluating dual-agent search and reasoning in complex, multi-turn, multi-modal recall tasks, revealing current models' limitations in underspecified scenarios.
Contribution
Introduces DETOUR, a dual-agent benchmark with 1,011 prompts for more realistic tip-of-the-tongue search evaluation, emphasizing multi-turn and multi-modal challenges.
Findings
State-of-the-art models achieve only 36% accuracy on DETOUR.
Current models struggle with underspecified, multi-modal recall tasks.
Highlights need for improved reasoning and retrieval capabilities.
Abstract
When recalling information in conversation, people often arrive at the recollection after multiple turns. However, existing benchmarks for evaluating agent capabilities in such tip-of-the-tongue search processes are restricted to single-turn settings. To more realistically simulate tip-of-the-tongue search, we introduce Dual-agent based Evaluation Through Obscure Under-specified Retrieval (DETOUR), a dual-agent evaluation benchmark containing 1,011 prompts. The benchmark design involves a Primary Agent, which is the subject of evaluation, tasked with identifying the recollected entity through querying a Memory Agent that is held consistent across evaluations. Our results indicate that current state-of-the-art models still struggle with our benchmark, only achieving 36% accuracy when evaluated on all modalities (text, image, audio, and video), highlighting the importance of enhancing…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsMultimodal Machine Learning Applications · Topic Modeling · Social Robot Interaction and HRI
