HiSciBench: A Hierarchical Multi-disciplinary Benchmark for Scientific Intelligence from Reading to Discovery

Yaping Zhang; Qixuan Zhang; Xingquan Zhang; Zhiyuan Chen; Wenwen Zhuang; Yupu Liang; Lu Xiang; Yang Zhao; Jiajun Zhang; Yu Zhou; and Chengqing Zong

arXiv:2512.22899·cs.AI·December 30, 2025

HiSciBench: A Hierarchical Multi-disciplinary Benchmark for Scientific Intelligence from Reading to Discovery

Yaping Zhang, Qixuan Zhang, Xingquan Zhang, Zhiyuan Chen, Wenwen Zhuang, Yupu Liang, Lu Xiang, Yang Zhao, Jiajun Zhang, Yu Zhou, and Chengqing Zong

PDF

Open Access 1 Models

TL;DR

HiSciBench is a comprehensive hierarchical benchmark for evaluating scientific intelligence in foundation models across multiple disciplines and stages of scientific reasoning, highlighting significant performance gaps and guiding future development.

Contribution

It introduces a multi-level, multi-disciplinary benchmark that assesses models on the full scientific workflow, from literacy to discovery, with integrated, dependency-aware evaluation.

Findings

01

Models perform well on basic literacy tasks (up to 69% accuracy).

02

Performance drops significantly on complex discovery tasks (down to 25%).

03

HiSciBench provides detailed diagnostics for scientific reasoning capabilities.

Abstract

The rapid advancement of large language models (LLMs) and multimodal foundation models has sparked growing interest in their potential for scientific research. However, scientific intelligence encompasses a broad spectrum of abilities ranging from understanding fundamental knowledge to conducting creative discovery, and existing benchmarks remain fragmented. Most focus on narrow tasks and fail to reflect the hierarchical and multi-disciplinary nature of real scientific inquiry. We introduce \textbf{HiSciBench}, a hierarchical benchmark designed to evaluate foundation models across five levels that mirror the complete scientific workflow: \textit{Scientific Literacy} (L1), \textit{Literature Parsing} (L2), \textit{Literature-based Question Answering} (L3), \textit{Literature Review Generation} (L4), and \textit{Scientific Discovery} (L5). HiSciBench contains 8,735 carefully curated…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Models

🤗
ScienceOne-AI/HiSciBench
model· ♡ 3
♡ 3

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsMachine Learning in Materials Science · Topic Modeling · Artificial Intelligence in Healthcare and Education