SIN-Bench: Tracing Native Evidence Chains in Long-Context Multimodal Scientific Interleaved Literature

Yiming Ren; Junjie Wang; Yuxin Meng; Yihang Shi; Zhiqiang Lin; Ruihang Chu; Yiran Xu; Ziming Li; Yunfei Zhao; Zihan Wang; Yu Qiao; Ruiming Tang; Minghao Liu; Yujiu Yang

arXiv:2601.10108·cs.CL·January 16, 2026

SIN-Bench: Tracing Native Evidence Chains in Long-Context Multimodal Scientific Interleaved Literature

Yiming Ren, Junjie Wang, Yuxin Meng, Yihang Shi, Zhiqiang Lin, Ruihang Chu, Yiran Xu, Ziming Li, Yunfei Zhao, Zihan Wang, Yu Qiao, Ruiming Tang, Minghao Liu, Yujiu Yang

PDF

Open Access 1 Datasets

TL;DR

This paper introduces SIN-Bench, a new benchmark for evaluating multimodal scientific language models on their ability to construct explicit evidence chains within long, interleaved text and figures, emphasizing verifiable reasoning.

Contribution

It proposes the FITO paradigm and SIN-Data corpus, along with four progressive tasks and a novel scoring method to assess evidence-linked understanding in scientific documents.

Findings

01

Grounding is the main bottleneck in model performance.

02

Gemini-3-pro achieves the highest overall score (0.573).

03

GPT-5 has the highest answer accuracy (0.767) but lower evidence alignment.

Abstract

Evaluating whether multimodal large language models truly understand long-form scientific papers remains challenging: answer-only metrics and synthetic "Needle-In-A-Haystack" tests often reward answer matching without requiring a causal, evidence-linked reasoning trace in the document. We propose the "Fish-in-the-Ocean" (FITO) paradigm, which requires models to construct explicit cross-modal evidence chains within native scientific documents. To operationalize FITO, we build SIN-Data, a scientific interleaved corpus that preserves the native interleaving of text and figures. On top of it, we construct SIN-Bench with four progressive tasks covering evidence discovery (SIN-Find), hypothesis verification (SIN-Verify), grounded QA (SIN-QA), and evidence-anchored synthesis (SIN-Summary). We further introduce "No Evidence, No Score", scoring predictions when grounded to verifiable anchors and…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Datasets

IIGroup/SIN-Bench
dataset· 2.0k dl
2.0k dl

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsTopic Modeling · Biomedical Text Mining and Ontologies · Computational and Text Analysis Methods