Can Agents Price a Reaction? Evaluating LLMs on Chemical Cost Reasoning
Yuyang Wu, Yue Huang, Shuaike Shen, Xujian Wang, Shuhao Zhang, Qiyao Xue, Weichen Liu, Runtian Gao, Jian Ma, Xiangliang Zhang, Olexandr Isayev

TL;DR
This paper introduces ChemCost, a benchmark for evaluating large language models on chemical cost reasoning, highlighting the challenges and failure modes in tool use and grounding in chemistry tasks.
Contribution
It provides a new benchmark with a large dataset and detailed analysis for assessing LLMs' ability to perform chemical procurement cost estimation.
Findings
Strongest agents achieve only 50.6% accuracy within 25% error on clean inputs.
Performance degrades significantly with realistic noise and input perturbations.
Failures are mainly due to brittle parsing, poor evidence integration, and invalid pack selection.
Abstract
Large Language Models (LLMs) have become increasingly capable as tool-using agents, with benchmarks spanning diverse general agentic tasks. Yet rigorous evaluation of scientific tool use remains limited. In chemistry, recent agents can plan syntheses and invoke domain-specific tools, but evaluations often rely on curated demonstrations, expert assessment, or LLM-as-judge scoring rather than exact, judge-free ground truth. We address this gap with chemical procurement cost estimation, a practical task in which an agent must ground chemical identities, retrieve supplier quotes, select valid purchasable packs, normalize quantities, and compute cost from a reaction description. We introduce ChemCost, a benchmark of 1,427 evaluable reactions grounded to a frozen pricing snapshot covering 2,261 chemicals and 230,775 supplier quotes, supporting scalar scoring and stage-level diagnosis of…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
