FrontierScience: Evaluating AI's Ability to Perform Expert-Level Scientific Tasks

Miles Wang; Robi Lin; Kat Hu; Joy Jiao; Neil Chowdhury; Ethan Chang; Tejal Patwardhan

arXiv:2601.21165·cs.AI·January 30, 2026

FrontierScience: Evaluating AI's Ability to Perform Expert-Level Scientific Tasks

Miles Wang, Robi Lin, Kat Hu, Joy Jiao, Neil Chowdhury, Ethan Chang, Tejal Patwardhan

PDF

Open Access 1 Datasets

TL;DR

FrontierScience is a comprehensive benchmark designed to evaluate AI models on expert-level scientific reasoning tasks across physics, chemistry, and biology, using Olympiad and research problem tracks.

Contribution

It introduces a novel, multi-faceted evaluation framework with real-world, PhD-level scientific problems and a detailed rubric-based assessment method.

Findings

01

Models show significant progress but still struggle with complex scientific reasoning.

02

The benchmark reveals gaps in AI's ability to perform at expert scientific levels.

03

It provides a new standard for measuring scientific reasoning in AI.

Abstract

We introduce FrontierScience, a benchmark evaluating expert-level scientific reasoning in frontier language models. Recent model progress has nearly saturated existing science benchmarks, which often rely on multiple-choice knowledge questions or already published information. FrontierScience addresses this gap through two complementary tracks: (1) Olympiad, consisting of international olympiad problems at the level of IPhO, IChO, and IBO, and (2) Research, consisting of PhD-level, open-ended problems representative of sub-tasks in scientific research. FrontierScience contains several hundred questions (including 160 in the open-sourced gold set) covering subfields across physics, chemistry, and biology, from quantum electrodynamics to synthetic organic chemistry. All Olympiad problems are originally produced by international Olympiad medalists and national team coaches to ensure…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Datasets

Maxue627/SCILLM-benchmarks
dataset· 154 dl
154 dl

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsMachine Learning in Materials Science · Artificial Intelligence in Healthcare and Education · Scientific Computing and Data Management