SUPERChem: A Multimodal Reasoning Benchmark in Chemistry

Zehua Zhao; Zhixian Huang; Junren Li; Siyu Lin; Junting Zhou; Fengqi Cao; Kun Zhou; Rui Ge; Tingting Long; Yuexiang Zhu; Yan Liu; Jie Zheng; Junnian Wei; Rong Zhu; Peng Zou; Wenyu Li; Zekai Cheng; Tian Ding; Yaxuan Wang; Yizhao Yan; Tingru Wei; Haowei Ming; Weijie Mao; Chen Sun; Yiming Liu; Zichen Wang; Zuo Zhang; Tong Yang; Hao Ma; Zhen Gao; and Jian Pei

arXiv:2512.01274·cs.CL·December 2, 2025

SUPERChem: A Multimodal Reasoning Benchmark in Chemistry

Zehua Zhao, Zhixian Huang, Junren Li, Siyu Lin, Junting Zhou, Fengqi Cao, Kun Zhou, Rui Ge, Tingting Long, Yuexiang Zhu, Yan Liu, Jie Zheng, Junnian Wei, Rong Zhu, Peng Zou, Wenyu Li, Zekai Cheng, Tian Ding, Yaxuan Wang, Yizhao Yan, Tingru Wei, Haowei Ming, Weijie Mao, Chen Sun

PDF

Open Access 5 Datasets

TL;DR

SUPERChem is a comprehensive, multimodal benchmark with expert-curated chemistry problems designed to evaluate and advance the reasoning capabilities of large language models beyond simple accuracy metrics.

Contribution

It introduces a new challenging benchmark with process-level evaluation and a scoring method for reasoning quality, addressing limitations of previous chemistry reasoning assessments.

Findings

01

GPT-5 achieves 38.5% accuracy, close to human baseline.

02

Multimodal information influences model reasoning differently.

03

The benchmark distinguishes high-fidelity reasoners from heuristic approaches.

Abstract

Current benchmarks for evaluating the chemical reasoning capabilities of Large Language Models (LLMs) are limited by oversimplified tasks, lack of process-level evaluation, and misalignment with expert-level chemistry skills. To address these issues, we introduce SUPERChem, a benchmark of 500 expert-curated reasoning-intensive chemistry problems, covering diverse subfields and provided in both multimodal and text-only formats. Original content and an iterative curation pipeline eliminate flawed items and mitigate data contamination. Each problem is paired with an expert-authored solution path, enabling Reasoning Path Fidelity (RPF) scoring to evaluate reasoning quality beyond final-answer accuracy. Evaluations against a human baseline of 40.3% accuracy show that even the best-performing model, GPT-5 (High), reaches only 38.5%, followed closely by Gemini 2.5 Pro (37.9%) and…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Datasets

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsMachine Learning in Materials Science · Computational Drug Discovery Methods · Advanced Graph Neural Networks