Beyond Chemical QA: Evaluating LLM's Chemical Reasoning with Modular Chemical Operations

Hao Li; He Cao; Bin Feng; Yanjun Shao; Xiangru Tang; Zhiyuan Yan; Li Yuan; Yonghong Tian; Yu Li

arXiv:2505.21318·cs.AI·January 8, 2026

Beyond Chemical QA: Evaluating LLM's Chemical Reasoning with Modular Chemical Operations

Hao Li, He Cao, Bin Feng, Yanjun Shao, Xiangru Tang, Zhiyuan Yan, Li Yuan, Yonghong Tian, Yu Li

PDF

Open Access 2 Datasets 1 Video

TL;DR

This paper introduces ChemCoTBench, a framework for evaluating large language models' chemical reasoning through modular, step-by-step operations that mimic mathematical proofs, aiming to improve AI's role in chemical discovery.

Contribution

It proposes a novel reasoning framework that formalizes chemical problem-solving into transparent workflows using modular operations, bridging the gap between abstract reasoning and practical chemical tasks.

Findings

01

Models show improved reasoning on molecular optimization tasks.

02

ChemCoTBench provides structured datasets and evaluation metrics.

03

Baseline evaluations demonstrate the framework's effectiveness.

Abstract

While large language models (LLMs) with Chain-of-Thought (CoT) reasoning excel in mathematics and coding, their potential for systematic reasoning in chemistry, a domain demanding rigorous structural analysis for real-world tasks like drug design and reaction engineering, remains untapped. Current benchmarks focus on simple knowledge retrieval, neglecting step-by-step reasoning required for complex tasks such as molecular optimization and reaction prediction. To address this, we introduce ChemCoTBench, a reasoning framework that bridges molecular structure understanding with arithmetic-inspired operations, including addition, deletion, and substitution, to formalize chemical problem-solving into transparent, step-by-step workflows. By treating molecular transformations as modular "chemical operations", the framework enables slow-thinking reasoning, mirroring the logic of mathematical…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Datasets

Videos

Beyond Chemical QA: Evaluating LLM’s Chemical Reasoning with Modular Chemical Operations· underline

Taxonomy

TopicsSoftware Engineering Research

MethodsFocus