DiagramBank: A Large-scale Dataset of Diagram Design Exemplars with Paper Metadata for Retrieval-Augmented Generation
Tingwen Zhang, Ling Yue, Zhen Xu, Shaowu Pan

TL;DR
DiagramBank is a large-scale, curated dataset of scientific schematic diagrams with rich metadata, enabling retrieval and exemplar-driven generation of publication-quality figures to enhance AI-assisted scientific writing.
Contribution
The paper introduces DiagramBank, a comprehensive dataset of scientific diagrams with metadata, and provides tools for retrieval and figure synthesis to support AI-assisted publication creation.
Findings
Dataset contains 89,422 diagrams from top-tier publications.
Automated curation pipeline effectively extracts and filters diagrams.
Retrieval-augmented generation code demonstrates exemplar-driven figure synthesis.
Abstract
Recent advances in autonomous ``AI scientist'' systems have demonstrated the ability to automatically write scientific manuscripts and codes with execution. However, producing a publication-grade scientific diagram (e.g., teaser figure) is still a major bottleneck in the ``end-to-end'' paper generation process. For example, a teaser figure acts as a strategic visual interface and serves a different purpose than derivative data plots. It demands conceptual synthesis and planning to translate complex logic workflow into a compelling graphic that guides intuition and sparks curiosity. Existing AI scientist systems usually omit this component or fall back to an inferior alternative. To bridge this gap, we present DiagramBank, a large-scale dataset consisting of 89,422 schematic diagrams curated from existing top-tier scientific publications, designed for multimodal retrieval and…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
