SciIF: Benchmarking Scientific Instruction Following Towards Rigorous Scientific Intelligence

Encheng Su; Jianyu Wu; Chen Tang; Lintao Wang; Pengze Li; Aoran Wang; Jinouwen Zhang; Yizhou Wang; Yuan Meng; Xinzhu Ma; Shixiang Tang; Houqiang Li

arXiv:2601.04770·cs.AI·January 13, 2026

SciIF: Benchmarking Scientific Instruction Following Towards Rigorous Scientific Intelligence

Encheng Su, Jianyu Wu, Chen Tang, Lintao Wang, Pengze Li, Aoran Wang, Jinouwen Zhang, Yizhou Wang, Yuan Meng, Xinzhu Ma, Shixiang Tang, Houqiang Li

PDF

Open Access

TL;DR

SciIF introduces a comprehensive benchmark to evaluate large language models on their ability to follow scientific instructions rigorously, emphasizing constraint adherence, evidence provision, and reasoning transparency across multiple scientific disciplines.

Contribution

The paper presents SciIF, a novel benchmark that assesses scientific instruction following with emphasis on constraint compliance and evidence, addressing limitations of existing evaluation standards.

Findings

01

Models can be evaluated on both correctness and constraint adherence.

02

SciIF enables diagnosis of reasoning failures in scientific problem-solving.

03

Benchmark promotes development of more reliable scientific AI agents.

Abstract

As large language models (LLMs) transition from general knowledge retrieval to complex scientific discovery, their evaluation standards must also incorporate the rigorous norms of scientific inquiry. Existing benchmarks exhibit a critical blind spot: general instruction-following metrics focus on superficial formatting, while domain-specific scientific benchmarks assess only final-answer correctness, often rewarding models that arrive at the right result with the wrong reasons. To address this gap, we introduce scientific instruction following: the capability to solve problems while strictly adhering to the constraints that establish scientific validity. Specifically, we introduce SciIF, a multi-discipline benchmark that evaluates this capability by pairing university-level problems with a fixed catalog of constraints across three pillars: scientific conditions (e.g., boundary checks…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsTopic Modeling · Machine Learning in Materials Science · Multimodal Machine Learning Applications