A Benchmark for Deep Information Synthesis

Debjit Paul; Daniel Murphy; Milan Gritta; Ronald Cardenas; Victor Prokhorov; Lena Sophia Bolliger; Aysim Toker; Roy Miles; Andreea-Maria Oncescu; Jasivan Alex Sivakumar; Philipp Borchert; Ismail Elezi; Meiru Zhang; Ka Yiu Lee; Guchun Zhang; Jun Wang; Gerasimos Lampouras

arXiv:2602.21143·cs.AI·February 25, 2026

A Benchmark for Deep Information Synthesis

Debjit Paul, Daniel Murphy, Milan Gritta, Ronald Cardenas, Victor Prokhorov, Lena Sophia Bolliger, Aysim Toker, Roy Miles, Andreea-Maria Oncescu, Jasivan Alex Sivakumar, Philipp Borchert, Ismail Elezi, Meiru Zhang, Ka Yiu Lee, Guchun Zhang, Jun Wang, Gerasimos Lampouras

PDF

Open Access 3 Reviews

TL;DR

DEEPSYNTH is a new benchmark designed to evaluate large language models on complex, real-world information synthesis tasks across multiple domains, revealing current limitations in reasoning and hallucination management.

Contribution

The paper introduces DEEPSYNTH, a comprehensive benchmark with 120 tasks across diverse domains, specifically crafted to assess LLMs' abilities in information synthesis and reasoning.

Findings

01

State-of-the-art LLMs achieve low scores, indicating high difficulty.

02

Current agents struggle with hallucinations and reasoning over large data.

03

DEEPSYNTH highlights the need for improved reasoning capabilities in LLMs.

Abstract

Large language model (LLM)-based agents are increasingly used to solve complex tasks involving tool use, such as web browsing, code execution, and data analysis. However, current evaluation benchmarks do not adequately assess their ability to solve real-world tasks that require synthesizing information from multiple sources and inferring insights beyond simple fact retrieval. To address this, we introduce DEEPSYNTH, a novel benchmark designed to evaluate agents on realistic, time-consuming problems that combine information gathering, synthesis, and structured reasoning to produce insights. DEEPSYNTH contains 120 tasks collected across 7 domains and data sources covering 67 countries. DEEPSYNTH is constructed using a multi-stage data collection pipeline that requires annotators to collect official data sources, create hypotheses, perform manual analysis, and design tasks with verifiable…

Peer Reviews

Decision·ICLR 2026 Poster

Reviewer 01Rating 6Confidence 4

Strengths

- The construction process of DeepSynth is rigorous. It consists of five steps: (1) data source identification, (2) hypothesis generation, (3) hypothesis validation, (4) task formulation, and (5) data validation. All steps is done by expert annotators. The demographic information of the annotators & average time consumption are also reported in this paper. - DeepSynth is challenging for many state-of-the-art models and systems, showing huge gap between human capabilities and current models. - Th

Weaknesses

- I understand the benchmark is challenging. However, the evaluation metrics are hard to interpret (especially EM and LLM Judge Score, see the questions section). The value of EM is almost all 0s and LLM Judge Score might favor their own model family (GPT-4.1). - (minor) The annotator demographic might also be biased (75% male, 81.25% PhD).

Reviewer 02Rating 2Confidence 3

Strengths

- Multi-stage pipeline ensures non-memorizable, diverse tasks. - Covers 9 models/agents with metrics like F1/EM and ablations. - Breaks down performance by steps/operations.

Weaknesses

- Lacks full code/prompts, data promised post-acceptance. - Annotator selection biased; uneven regional coverage. - Overlaps with prior benchmarks without direct comparisons. - Low baselines may reflect poor prompting.

Reviewer 03Rating 8Confidence 3

Strengths

- Excellent observation regarding the lack of real-world tasks that require synthesizing information from multiple sources, with strong motivation for designing such a benchmark that is well-suited for advancing current LLM and agent systems which are becoming increasingly powerful and saturating existing benchmarks. - Dataset collection and curation process is exceptionally well-designed, executed, and presented through a rigorous 4-stage pipeline (data source identification, hypothesis generat

Weaknesses

- Overly difficult questions might limit the practical applicability of the benchmark, making evaluations of less powerful models even more challenging. For example, if less powerful models all achieve nearly zero performance, it becomes difficult to obtain meaningful signals to differentiate and evaluate which model performs better. In comparison, benchmarks with graduated difficulty levels generally suffer less from this limitation. - Since the benchmark requires gathering information from the

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsMachine Learning in Materials Science · Topic Modeling · Artificial Intelligence in Healthcare and Education