Beyond Single-shot Writing: Deep Research Agents are Unreliable at Multi-turn Report Revision
Bingsen Chen, Boyan Li, Ping Nie, Yuyu Zhang, Xi Ye, Chen Zhao

TL;DR
This paper evaluates the multi-turn report revision capabilities of Deep Research Agents (DRAs), revealing their unreliability in maintaining content consistency and quality over iterative revisions, which is a critical limitation for research report generation.
Contribution
Introduces Mr Dre, a comprehensive evaluation suite for multi-turn report revision in DRAs, highlighting their current limitations and the challenges in improving iterative report editing.
Findings
DRAs often regress on previously covered content and citations.
Agents disrupt content outside feedback scope during revisions.
Prompt engineering alone does not resolve revision issues.
Abstract
Existing benchmarks for Deep Research Agents (DRAs) treat report generation as a single-shot writing task, which fundamentally diverges from how human researchers iteratively draft and revise reports via self-reflection or peer feedback. Whether DRAs can reliably revise reports with user feedback remains unexplored. We introduce Mr Dre, an evaluation suite that establishes multi-turn report revision as a new evaluation axis for DRAs. Mr Dre consists of (1) a unified long-form report evaluation protocol spanning comprehensiveness, factuality, and presentation, and (2) a human-verified feedback simulation pipeline for multi-turn revision. Our analysis of five diverse DRAs reveals a critical limitation: while agents can address most user feedback, they also regress on 16-27% of previously covered content and citation quality. Over multiple revision turns, even the best-performing agents…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsTopic Modeling · Artificial Intelligence in Healthcare and Education · Mental Health via Writing
