Strategic Navigation or Stochastic Search? How Agents and Humans Reason Over Document Collections

{\L}ukasz Borchmann; Jordy Van Landeghem; Micha{\l} Turski; Shreyansh Padarha; Ryan Othniel Kearns; Adam Mahdi; Niels Rogge; Cl\'ementine Fourrier; Siwei Han; Huaxiu Yao; Artemis Llabr\'es; Yiming Xu; Dimosthenis Karatzas; Hao Zhang; Anupam Datta

arXiv:2603.12180·cs.CL·March 23, 2026

Strategic Navigation or Stochastic Search? How Agents and Humans Reason Over Document Collections

{\L}ukasz Borchmann, Jordy Van Landeghem, Micha{\l} Turski, Shreyansh Padarha, Ryan Othniel Kearns, Adam Mahdi, Niels Rogge, Cl\'ementine Fourrier, Siwei Han, Huaxiu Yao, Artemis Llabr\'es, Yiming Xu, Dimosthenis Karatzas, Hao Zhang, Anupam Datta

PDF

Open Access 1 Datasets

TL;DR

This paper introduces MADQA, a benchmark to evaluate whether multimodal agents demonstrate genuine strategic reasoning or rely on stochastic search, revealing current limitations in agentic planning and efficiency.

Contribution

The paper presents MADQA, a new benchmark with an evaluation protocol to distinguish strategic reasoning from brute-force search in document question-answering agents.

Findings

01

Best agents match human accuracy but use different questions.

02

Agents rely on brute-force search due to weak strategic planning.

03

Significant gap remains to oracle performance, with unproductive loops.

Abstract

Multimodal agents offer a promising path to automating complex document-intensive workflows. Yet, a critical question remains: do these agents demonstrate genuine strategic reasoning, or merely stochastic trial-and-error search? To address this, we introduce MADQA, a benchmark of 2,250 human-authored questions grounded in 800 heterogeneous PDF documents. Guided by Classical Test Theory, we design it to maximize discriminative power across varying levels of agentic abilities. To evaluate agentic behaviour, we introduce a novel evaluation protocol measuring the accuracy-effort trade-off. Using this framework, we show that while the best agents can match human searchers in raw accuracy, they succeed on largely different questions and rely on brute-force search to compensate for weak strategic planning. They fail to close the nearly 20% gap to oracle performance, persisting in unproductive…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Datasets

OxRML/MADQA
dataset· 802 dl
802 dl

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsTopic Modeling · Natural Language Processing Techniques · Speech and dialogue systems