QCBench: Evaluating Large Language Models on Domain-Specific Quantitative Chemistry

Jiaqing Xie; Weida Wang; Ben Gao; Zhuo Yang; Haiyuan Wan; Shufei Zhang; Tianfan Fu; Yuqiang Li

arXiv:2508.01670·cs.AI·November 5, 2025

QCBench: Evaluating Large Language Models on Domain-Specific Quantitative Chemistry

Jiaqing Xie, Weida Wang, Ben Gao, Zhuo Yang, Haiyuan Wan, Shufei Zhang, Tianfan Fu, Yuqiang Li

PDF

TL;DR

QCBench is a comprehensive benchmark designed to evaluate large language models' ability to perform rigorous, step-by-step quantitative chemistry calculations across various subfields, revealing current limitations in scientific computation accuracy.

Contribution

This work introduces QCBench, the first extensive benchmark for assessing LLMs on domain-specific quantitative chemistry problems, enabling targeted diagnosis and future improvements.

Findings

01

Performance declines as task difficulty increases

02

Models struggle with explicit numerical reasoning in chemistry

03

Benchmark reveals gaps between language fluency and scientific accuracy

Abstract

Quantitative chemistry is central to modern chemical research, yet the ability of large language models (LLMs) to perform its rigorous, step-by-step calculations remains underexplored. To fill this blank, we propose QCBench, a Quantitative Chemistry oriented benchmark comprising 350 computational chemistry problems across 7 chemistry subfields, which contains analytical chemistry, bio/organic chemistry, general chemistry, inorganic chemistry, physical chemistry, polymer chemistry and quantum chemistry. To systematically evaluate the mathematical reasoning abilities of large language models (LLMs), they are categorized into three tiers: easy, medium, and difficult. Each problem, rooted in realistic chemical scenarios, is structured to prevent heuristic shortcuts and demand explicit numerical reasoning. QCBench enables fine-grained diagnosis of computational weaknesses, reveals…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.