DocRefine: An Intelligent Framework for Scientific Document Understanding and Content Optimization based on Multimodal Large Model Agents
Kun Qian, Wenjie Li, Tianyu Sun, Wenhong Wang, Wenhan Luo

TL;DR
DocRefine is an advanced multimodal framework that uses a multi-agent system with large models to understand, edit, and summarize scientific PDFs with high accuracy and fidelity.
Contribution
It introduces a multi-agent system leveraging LVLMs for precise scientific document understanding and content optimization, outperforming existing methods.
Findings
Achieves high semantic consistency and layout fidelity scores.
Outperforms state-of-the-art baselines on the DocEditBench dataset.
Demonstrates effective handling of complex multimodal document editing.
Abstract
The exponential growth of scientific literature in PDF format necessitates advanced tools for efficient and accurate document understanding, summarization, and content optimization. Traditional methods fall short in handling complex layouts and multimodal content, while direct application of Large Language Models (LLMs) and Vision-Language Large Models (LVLMs) lacks precision and control for intricate editing tasks. This paper introduces DocRefine, an innovative framework designed for intelligent understanding, content refinement, and automated summarization of scientific PDF documents, driven by natural language instructions. DocRefine leverages the power of advanced LVLMs (e.g., GPT-4o) by orchestrating a sophisticated multi-agent system comprising six specialized and collaborative agents: Layout & Structure Analysis, Multimodal Content Understanding, Instruction Decomposition,…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsTopic Modeling · Multimodal Machine Learning Applications · Advanced Text Analysis Techniques
