MoNaCo: More Natural and Complex Questions for Reasoning Across Dozens of Documents

Tomer Wolfson; Harsh Trivedi; Mor Geva; Yoav Goldberg; Dan Roth; Tushar Khot; Ashish Sabharwal; Reut Tsarfaty

arXiv:2508.11133·cs.CL·September 4, 2025

MoNaCo: More Natural and Complex Questions for Reasoning Across Dozens of Documents

Tomer Wolfson, Harsh Trivedi, Mor Geva, Yoav Goldberg, Dan Roth, Tushar Khot, Ashish Sabharwal, Reut Tsarfaty

PDF

1 Datasets

TL;DR

MoNaCo introduces a large benchmark of 1,315 natural, time-consuming questions requiring extensive reasoning across dozens of documents, exposing current LLM limitations in handling complex, real-world information-seeking tasks.

Contribution

The paper presents MoNaCo, a novel benchmark with real-world, complex questions, and a scalable annotation pipeline to evaluate LLMs on challenging reasoning tasks.

Findings

01

Frontier LLMs achieve at most 61.2% F1 on MoNaCo.

02

LLMs struggle with recall and hallucinations on complex questions.

03

MoNaCo highlights current limitations of LLMs in real-world reasoning.

Abstract

Automated agents, powered by Large language models (LLMs), are emerging as the go-to tool for querying information. However, evaluation benchmarks for LLM agents rarely feature natural questions that are both information-seeking and genuinely time-consuming for humans. To address this gap we introduce MoNaCo, a benchmark of 1,315 natural and time-consuming questions that require dozens, and at times hundreds, of intermediate steps to solve -- far more than any existing QA benchmark. To build MoNaCo, we developed a decomposed annotation pipeline to elicit and manually answer real-world time-consuming questions at scale. Frontier LLMs evaluated on MoNaCo achieve at most 61.2% F1, hampered by low recall and hallucinations. Our results underscore the limitations of LLM-powered agents in handling the complexity and sheer breadth of real-world information-seeking tasks -- with MoNaCo…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Datasets

allenai/MoNaCo_Benchmark
dataset· 376 dl
376 dl

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.