Strategic Navigation or Stochastic Search? How Agents and Humans Reason Over Document Collections
{\L}ukasz Borchmann, Jordy Van Landeghem, Micha{\l} Turski, Shreyansh Padarha, Ryan Othniel Kearns, Adam Mahdi, Niels Rogge, Cl\'ementine Fourrier, Siwei Han, Huaxiu Yao, Artemis Llabr\'es, Yiming Xu, Dimosthenis Karatzas, Hao Zhang, Anupam Datta

TL;DR
This paper introduces MADQA, a benchmark to evaluate whether multimodal agents demonstrate genuine strategic reasoning or rely on stochastic search, revealing current limitations in agentic planning and efficiency.
Contribution
The paper presents MADQA, a new benchmark with an evaluation protocol to distinguish strategic reasoning from brute-force search in document question-answering agents.
Findings
Best agents match human accuracy but use different questions.
Agents rely on brute-force search due to weak strategic planning.
Significant gap remains to oracle performance, with unproductive loops.
Abstract
Multimodal agents offer a promising path to automating complex document-intensive workflows. Yet, a critical question remains: do these agents demonstrate genuine strategic reasoning, or merely stochastic trial-and-error search? To address this, we introduce MADQA, a benchmark of 2,250 human-authored questions grounded in 800 heterogeneous PDF documents. Guided by Classical Test Theory, we design it to maximize discriminative power across varying levels of agentic abilities. To evaluate agentic behaviour, we introduce a novel evaluation protocol measuring the accuracy-effort trade-off. Using this framework, we show that while the best agents can match human searchers in raw accuracy, they succeed on largely different questions and rely on brute-force search to compensate for weak strategic planning. They fail to close the nearly 20% gap to oracle performance, persisting in unproductive…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsTopic Modeling · Natural Language Processing Techniques · Speech and dialogue systems
