OmniScience: A Large-scale Multi-modal Dataset for Scientific Image Understanding
Haoyi Tao, Chaozheng Huang, Nan Wang, Han Lyu, Linfeng Zhang, Guolin Ke, Xi Fang

TL;DR
OmniScience introduces a large, high-quality multi-modal dataset of scientific images with dense descriptions, enabling improved training and evaluation of models in scientific image understanding across multiple disciplines.
Contribution
The paper presents OmniScience, a comprehensive dataset with a novel re-captioning pipeline and quality filtering, enhancing scientific image understanding for large multimodal models.
Findings
Significant improvement in multi-modal similarity scores.
Enhanced model performance on scientific image understanding benchmarks.
High-fidelity, densely annotated dataset covering multiple scientific disciplines.
Abstract
Multimodal Large Language Models demonstrate strong performance on natural image understanding, yet exhibit limited capability in interpreting scientific images, including but not limited to schematic diagrams, experimental characterizations, and analytical charts. This limitation is particularly pronounced in open-source MLLMs. The gap largely stems from existing datasets with limited domain coverage, coarse structural annotations, and weak semantic grounding. We introduce OmniScience, a large-scale, high-fidelity multi-modal dataset comprising 1.5 million figure-caption-context triplets, spanning more than 10 major scientific disciplines. To obtain image caption data with higher information density and accuracy for multi-modal large-model training, we develop a dynamic model-routing re-captioning pipeline that leverages state-of-the-art multi-modal large language models to generate…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsMultimodal Machine Learning Applications · Generative Adversarial Networks and Image Synthesis · Topic Modeling
