OmniScience: A Large-scale Multi-modal Dataset for Scientific Image Understanding

Haoyi Tao; Chaozheng Huang; Nan Wang; Han Lyu; Linfeng Zhang; Guolin Ke; Xi Fang

arXiv:2602.13758·cs.CV·February 17, 2026

OmniScience: A Large-scale Multi-modal Dataset for Scientific Image Understanding

Haoyi Tao, Chaozheng Huang, Nan Wang, Han Lyu, Linfeng Zhang, Guolin Ke, Xi Fang

PDF

Open Access 1 Datasets

TL;DR

OmniScience introduces a large, high-quality multi-modal dataset of scientific images with dense descriptions, enabling improved training and evaluation of models in scientific image understanding across multiple disciplines.

Contribution

The paper presents OmniScience, a comprehensive dataset with a novel re-captioning pipeline and quality filtering, enhancing scientific image understanding for large multimodal models.

Findings

01

Significant improvement in multi-modal similarity scores.

02

Enhanced model performance on scientific image understanding benchmarks.

03

High-fidelity, densely annotated dataset covering multiple scientific disciplines.

Abstract

Multimodal Large Language Models demonstrate strong performance on natural image understanding, yet exhibit limited capability in interpreting scientific images, including but not limited to schematic diagrams, experimental characterizations, and analytical charts. This limitation is particularly pronounced in open-source MLLMs. The gap largely stems from existing datasets with limited domain coverage, coarse structural annotations, and weak semantic grounding. We introduce OmniScience, a large-scale, high-fidelity multi-modal dataset comprising 1.5 million figure-caption-context triplets, spanning more than 10 major scientific disciplines. To obtain image caption data with higher information density and accuracy for multi-modal large-model training, we develop a dynamic model-routing re-captioning pipeline that leverages state-of-the-art multi-modal large language models to generate…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Datasets

UniParser/OmniScience
dataset· 6.1k dl
6.1k dl

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsMultimodal Machine Learning Applications · Generative Adversarial Networks and Image Synthesis · Topic Modeling