Deep-Reporter: Deep Research for Grounded Multimodal Long-Form Generation

Fangda Ye; Zhifei Xie; Yuxin Hu; Yihang Yin; Shurui Huang; Shikai Dong; Jianzhu Bao; Shuicheng Yan

arXiv:2604.10741·cs.CL·April 21, 2026

Deep-Reporter: Deep Research for Grounded Multimodal Long-Form Generation

Fangda Ye, Zhifei Xie, Yuxin Hu, Yihang Yin, Shurui Huang, Shikai Dong, Jianzhu Bao, Shuicheng Yan

PDF

TL;DR

Deep-Reporter introduces a novel framework for grounded multimodal long-form generation, integrating multimodal search, incremental synthesis, and context management to improve factual accuracy and coherence.

Contribution

It presents a unified agentic framework and a new benchmark for multimodal long-form generation, addressing the challenge of integrating diverse evidence.

Findings

01

Effective multimodal retrieval and filtering demonstrated

02

Incremental synthesis improves coherence and citation accuracy

03

Post-training enhances multimodal integration performance

Abstract

Recent agentic search frameworks enable deep research via iterative planning and retrieval, reducing hallucinations and enhancing factual grounding. However, they remain text-centric, overlooking the multimodal evidence that characterizes real-world expert reports. We introduce a pressing task: multimodal long-form generation. Accordingly, we propose Deep-Reporter, a unified agentic framework for grounded multimodal long-form generation. It orchestrates: (i) Agentic Multimodal Search and Filtering to retrieve and filter textual passages and information-dense visuals; (ii) Checklist-Guided Incremental Synthesis to ensure coherent image-text integration and optimal citation placement; and (iii) Recurrent Context Management to balance long-range coherence with local fluency. We develop a rigorous curation pipeline producing 8K high-quality agentic traces for model optimization. We further…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.