DeepSynth-Eval: Objectively Evaluating Information Consolidation in Deep Survey Writing

Hongzhi Zhang; Yuanze Hu; Tinghai Zhang; Jia Fu; Tao Wang; Junwei Jing; Zhaoxin Fan; Qi Wang; Ruiming Tang; Han Li; Guorui Zhou; Kun Gai

arXiv:2601.03540·cs.CL·January 8, 2026

DeepSynth-Eval: Objectively Evaluating Information Consolidation in Deep Survey Writing

Hongzhi Zhang, Yuanze Hu, Tinghai Zhang, Jia Fu, Tao Wang, Junwei Jing, Zhaoxin Fan, Qi Wang, Ruiming Tang, Han Li, Guorui Zhou, Kun Gai

PDF

Open Access

TL;DR

DeepSynth-Eval introduces an objective benchmark for evaluating how well large language models synthesize and organize information in long-form reports, addressing a key challenge in autonomous research agents.

Contribution

It presents a novel, fine-grained evaluation protocol and benchmark for assessing information consolidation in LLMs, using high-quality survey papers as gold standards.

Findings

01

Multi-reference synthesis remains challenging for LLMs.

02

Plan-and-write workflows outperform single-turn generation.

03

Structured evaluation reduces hallucinations and improves coherence.

Abstract

The evolution of Large Language Models (LLMs) towards autonomous agents has catalyzed progress in Deep Research. While retrieval capabilities are well-benchmarked, the post-retrieval synthesis stage--where agents must digest massive amounts of context and consolidate fragmented evidence into coherent, long-form reports--remains under-evaluated due to the subjectivity of open-ended writing. To bridge this gap, we introduce DeepSynth-Eval, a benchmark designed to objectively evaluate information consolidation capabilities. We leverage high-quality survey papers as gold standards, reverse-engineering research requests and constructing "Oracle Contexts" from their bibliographies to isolate synthesis from retrieval noise. We propose a fine-grained evaluation protocol using General Checklists (for factual coverage) and Constraint Checklists (for structural organization), transforming…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsMachine Learning in Materials Science · Topic Modeling · Scientific Computing and Data Management