MolecularIQ: Characterizing Chemical Reasoning Capabilities Through Symbolic Verification on Molecular Graphs
Christoph Bartmann, Johannes Schimunek, Mykyta Ielanskyi, Philipp Seidl, G\"unter Klambauer, Sohvi Luukkonen

TL;DR
MolecularIQ is a new benchmark designed to evaluate the reasoning capabilities of models on molecular graphs through symbolically verifiable tasks, providing insights into their strengths and limitations in chemical reasoning.
Contribution
It introduces MolecularIQ, a benchmark focused on symbolic reasoning over molecular structures, enabling detailed evaluation of chemistry models' reasoning abilities.
Findings
Reveals specific model failure patterns on molecular graph tasks
Provides actionable insights for improving chemistry LLMs
Highlights the importance of structure-aware reasoning in chemistry models
Abstract
A molecule's properties are fundamentally determined by its composition and structure encoded in its molecular graph. Thus, reasoning about molecular properties requires the ability to parse and understand the molecular graph. Large Language Models (LLMs) are increasingly applied to chemistry, tackling tasks such as molecular name conversion, captioning, text-guided generation, and property or reaction prediction. Most existing benchmarks emphasize general chemical knowledge, rely on literature or surrogate labels that risk leakage or bias, or reduce evaluation to multiple-choice questions. We introduce MolecularIQ, a molecular structure reasoning benchmark focused exclusively on symbolically verifiable tasks. MolecularIQ enables fine-grained evaluation of reasoning over molecular graphs and reveals capability patterns that localize model failures to specific tasks and molecular…
Peer Reviews
Decision·ICLR 2026 Poster
* This paper provides a wide-ranging evaluation across diverse model types and sizes. * The introduced MoleculeQID can be further utilized for new and open molecules, guaranteeing its scalability.
* This study is grounded in the belief that there is a positive correlation between molecular structural understanding and molecular reasoning ability for complex property prediction. However, this belief is not explicitly demonstrated, so the necessity of building a structure-reasoning benchmark appears limited. I suggest presenting the relationship between structural understanding and predictive performance on molecular properties. * Overthinking is a well-known pitfall in molecular structure
1. The paper presents a verifiable benchmark and all tasks have ground-truth solutions, allowing reliable automatic evaluation. 2. The benchmark varies the molecular complexity and tests different chemical reasoning skills.
1. The benchmark excludes tasks like quantitative property prediction or reaction prediction, so it does not evaluate LLMs’ ability on some real-world chemistry problems. 2. The tasks in the benchmark are fundamental checks that chemical softwares can do straightforwardly. They may be somewhat disconnected from how humans typically solve chemistry problems. For example, asking a model to ‘generate a molecule with two rings and five heterostoms’ is more like a puzzle or exercise than creative pr
Tight problem statement & clear contribution. A fully symbolically verifiable chemistry benchmark focused on structure-grounded reasoning (not factual recall) is timely and well-motivated. Three-axis profiling. Disentangling reasoning type, multitask load, and molecular complexity provides diagnostic granularity and actionable error localization. Index-based tasks. Pairing counting with index attribution helps distinguish genuine graph reasoning from pattern-matching/shortcut counts. Solid ev
2D-only scope. Restricting to graph connectivity omits 3D stereoelectronic/conformational effects that matter for realistic chemical reasoning; several stereochemistry tasks may still be fragile under a purely 2D treatment. Verifier dependence & edge cases. Heavy reliance on RDKit rules brings corner-case risk (e.g., aromaticity/kekulization, tautomers, undefined stereocenters). Clear auditing and unit tests for borderline cases would strengthen claims. Dataset scale & coverage. The main stati
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsMachine Learning in Materials Science · Advanced Graph Neural Networks · Computational Drug Discovery Methods
