S1-MMAlign: A Large-Scale, Multi-Disciplinary Dataset for Scientific Figure-Text Understanding

He Wang; Longteng Guo; Pengkang Huo; Xuanxu Lin; Yichen Yuan; Jie Jiang; Jing Liu

arXiv:2601.00264·cs.CV·May 7, 2026

S1-MMAlign: A Large-Scale, Multi-Disciplinary Dataset for Scientific Figure-Text Understanding

He Wang, Longteng Guo, Pengkang Huo, Xuanxu Lin, Yichen Yuan, Jie Jiang, Jing Liu

PDF

1 Repo 5 Datasets

TL;DR

S1-MMAlign is a large, multi-disciplinary scientific dataset with over 15.5 million high-quality image-text pairs, enhanced by an AI-driven recaptioning pipeline to improve scientific multimodal understanding.

Contribution

The paper introduces S1-MMAlign, a comprehensive dataset with an innovative semantic enhancement pipeline leveraging large language models for better scientific figure-text alignment.

Findings

01

Enhanced data quality confirmed by reduced SciBERT pseudo-perplexity.

02

Improved CLIP image-text alignment demonstrated.

03

Boosted performance in zero-shot scientific captioning and reasoning tasks.

Abstract

Multimodal learning has revolutionized general domain tasks, yet its application in scientific discovery is hindered by the profound semantic gap between complex scientific imagery and sparse textual descriptions. We present S1-MMAlign, a large-scale, multi-disciplinary multimodal dataset comprising over 15.5 million high-quality image-text pairs derived from 2.5 million open-access scientific papers. Spanning disciplines from physics and biology to engineering, the dataset captures diverse visual modalities including experimental setups, heatmaps, and microscopic imagery. To address the pervasive issue of weak alignment in raw scientific captions, we introduce an AI-ready semantic enhancement pipeline that leverages advanced multimodal large language models to recaption images, by synthesizing comprehensive context from paper abstracts and the citation contexts of corresponding…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

https://huggingface.co/datasets/ScienceOne-AI/S1-MMAlign
github

Datasets

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.