COMPOSITE-Stem

Kyle Waters; Lucas Nuzzi; Tadhg Looram; Alessandro Tomasiello; Ariel Ghislain Kemogne Kamdoum; Bikun Li; Damien Sileo; Egor Kretov; Francesco Fournier-Facio; Georgios Soloupis; Haile Kassahun; Hew Wolff; Jiaqi Cai; Lianghui Li; Marc Roth; Mohinder Naiya; Naixu Guo; Qicheng Tang; Richard Wheeler; Samuele Sala; Serguei Popov; Steven Dillmann; Yuqi Li

arXiv:2604.09836·cs.AI·April 20, 2026

COMPOSITE-Stem

Kyle Waters, Lucas Nuzzi, Tadhg Looram, Alessandro Tomasiello, Ariel Ghislain Kemogne Kamdoum, Bikun Li, Damien Sileo, Egor Kretov, Francesco Fournier-Facio, Georgios Soloupis, Haile Kassahun, Hew Wolff, Jiaqi Cai, Lianghui Li, Marc Roth, Mohinder Naiya, Naixu Guo, Qicheng Tang

PDF

1 Datasets

TL;DR

COMPOSITE-STEM is a new benchmark with 70 expert-crafted tasks across sciences, designed to evaluate AI reasoning beyond current model capabilities using flexible grading protocols.

Contribution

It introduces a comprehensive, expert-curated benchmark with novel grading methods to better assess AI's scientific reasoning in physics, biology, chemistry, and mathematics.

Findings

01

Top model achieves 21% accuracy on COMPOSITE-STEM.

02

Benchmark captures capabilities beyond current frontier models.

03

All tasks are open-sourced for reproducibility and further research.

Abstract

AI agents hold growing promise for accelerating scientific discovery; yet, a lack of frontier evaluations hinders adoption into real workflows. Expert-written benchmarks have proven effective at measuring AI reasoning, but most at this stage have become saturated and only measure performance on constrained outputs. To help address this gap, we introduce COMPOSITE-STEM, a benchmark of 70 expert-written tasks in physics, biology, chemistry, and mathematics, curated by doctoral-level researchers. Our benchmark combines exact-match grading and criterion-based rubrics with an LLM-as-a-jury grading protocol, allowing more flexible assessment of scientifically meaningful outputs. Using an adapted multimodal Terminus-2 agent harness within the Harbor agentic evaluation framework, we evaluate four frontier models. The top-performing model achieves 21%, demonstrating that COMPOSITE-STEM captures…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Datasets

portex/COMPOSITE-STEM
dataset· 146 dl
146 dl

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.