SciMDR: Advancing Scientific Multimodal Document Reasoning

Ziyu Chen; Yilun Zhao; Chengye Wang; Rilyn Han; Manasi Patwardhan; Arman Cohan

arXiv:2603.12249·cs.CL·April 30, 2026

SciMDR: Advancing Scientific Multimodal Document Reasoning

Ziyu Chen, Yilun Zhao, Chengye Wang, Rilyn Han, Manasi Patwardhan, Arman Cohan

PDF

TL;DR

This paper introduces SciMDR, a large-scale scientific multimodal dataset created through a novel synthesize-and-reground framework, enhancing foundation models' ability to perform complex reasoning across scientific documents.

Contribution

The paper presents a new framework for generating scientific multimodal datasets and releases SciMDR, a large-scale dataset with 300K QA pairs for improved scientific document reasoning.

Findings

01

Models trained on SciMDR show significant performance improvements.

02

The framework effectively balances scale, faithfulness, and realism in dataset creation.

03

SciMDR-Eval provides a benchmark for scientific multimodal comprehension.

Abstract

Constructing scientific multimodal document reasoning datasets for foundation model training involves an inherent trade-off among scale, faithfulness, and realism. To address this challenge, we introduce the synthesize-and-reground framework, a two-stage pipeline comprising: (1) Claim-Centric QA Synthesis, which generates faithful, isolated QA pairs and reasoning on focused segments, and (2) Document-Scale Regrounding, which programmatically re-embeds these pairs into full-document tasks to ensure realistic complexity. Using this framework, we construct SciMDR, a large-scale training dataset for cross-modal comprehension, comprising 300K QA pairs with explicit reasoning chains across 20K scientific papers. We further construct SciMDR-Eval, an expert-annotated benchmark to evaluate multimodal comprehension within full-length scientific workflows. Experiments demonstrate that models…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.