ATLAS: A High-Difficulty, Multidisciplinary Benchmark for Frontier Scientific Reasoning

Hongwei Liu; Junnan Liu; Shudong Liu; Haodong Duan; Yuqiang Li; Mao Su; Xiaohong Liu; Guangtao Zhai; Xinyu Fang; Qianhong Ma; Taolin Zhang; Zihan Ma; Yufeng Zhao; Peiheng Zhou; Linchen Xiao; Wenlong Zhang; Shijie Zhou; Xingjian Ma; Siqi Sun; Jiaye Ge; Meng Li; Yuhong Liu; Jianxin Dong; Jiaying Li; Hui Wu; Hanwen Liang; Jintai Lin; Yanting Wang; Jie Dong; Tong Zhu; Tianfan Fu; Conghui He; Qi Zhang; Songyang Zhang; Lei Bai; Kai Chen

arXiv:2511.14366·cs.CL·November 21, 2025

ATLAS: A High-Difficulty, Multidisciplinary Benchmark for Frontier Scientific Reasoning

Hongwei Liu, Junnan Liu, Shudong Liu, Haodong Duan, Yuqiang Li, Mao Su, Xiaohong Liu, Guangtao Zhai, Xinyu Fang, Qianhong Ma, Taolin Zhang, Zihan Ma, Yufeng Zhao, Peiheng Zhou, Linchen Xiao, Wenlong Zhang, Shijie Zhou, Xingjian Ma, Siqi Sun, Jiaye Ge, Meng Li, Yuhong Liu

PDF

Open Access 2 Datasets

TL;DR

ATLAS is a comprehensive, high-difficulty, multidisciplinary benchmark designed to evaluate advanced scientific reasoning in large language models, emphasizing originality, cross-disciplinary integration, and complex answer formats.

Contribution

It introduces a large-scale, rigorously curated evaluation suite with novel questions across seven scientific fields, addressing limitations of existing benchmarks.

Findings

01

Effective in differentiating model reasoning capabilities

02

Demonstrates the importance of cross-disciplinary evaluation

03

Highlights the challenge of complex scientific reasoning in LLMs

Abstract

The rapid advancement of Large Language Models (LLMs) has led to performance saturation on many established benchmarks, questioning their ability to distinguish frontier models. Concurrently, existing high-difficulty benchmarks often suffer from narrow disciplinary focus, oversimplified answer formats, and vulnerability to data contamination, creating a fidelity gap with real-world scientific inquiry. To address these challenges, we introduce ATLAS (AGI-Oriented Testbed for Logical Application in Science), a large-scale, high-difficulty, and cross-disciplinary evaluation suite composed of approximately 800 original problems. Developed by domain experts (PhD-level and above), ATLAS spans seven core scientific fields: mathematics, physics, chemistry, biology, computer science, earth science, and materials science. Its key features include: (1) High Originality and Contamination…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Datasets

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsTopic Modeling · Scientific Computing and Data Management · Machine Learning in Materials Science