When Single Answer Is Not Enough: Rethinking Single-Step Retrosynthesis Benchmarks for LLMs

Bogdan Zagribelnyy; Ivan Ilin; Maksim Kuznetsov; Nikita Bondarev; Roman Schutski; Thomas MacDougall; Rim Shayakhmetov; Zulfat Miftakhutdinov; Mikolaj Mizera; Vladimir Aladinskiy; Alex Aliper; Alex Zhavoronkov

arXiv:2602.03554·cs.LG·February 4, 2026

When Single Answer Is Not Enough: Rethinking Single-Step Retrosynthesis Benchmarks for LLMs

Bogdan Zagribelnyy, Ivan Ilin, Maksim Kuznetsov, Nikita Bondarev, Roman Schutski, Thomas MacDougall, Rim Shayakhmetov, Zulfat Miftakhutdinov, Mikolaj Mizera, Vladimir Aladinskiy, Alex Aliper, Alex Zhavoronkov

PDF

Open Access

TL;DR

This paper introduces a new benchmarking framework for single-step retrosynthesis that emphasizes chemical plausibility over exact matches, using a novel metric and dataset to better evaluate and improve LLMs in synthesis planning.

Contribution

It proposes ChemCensor, a new chemical plausibility metric, and CREED, a large validated dataset, to enhance the evaluation and training of LLMs for retrosynthesis.

Findings

01

ChemCensor correlates better with human judgment than traditional metrics.

02

Training with CREED improves LLM performance on the new benchmark.

03

The framework better captures the open-ended nature of real-world synthesis planning.

Abstract

Recent progress has expanded the use of large language models (LLMs) in drug discovery, including synthesis planning. However, objective evaluation of retrosynthesis performance remains limited. Existing benchmarks and metrics typically rely on published synthetic procedures and Top-K accuracy based on single ground-truth, which does not capture the open-ended nature of real-world synthesis planning. We propose a new benchmarking framework for single-step retrosynthesis that evaluates both general-purpose and chemistry-specialized LLMs using ChemCensor, a novel metric for chemical plausibility. By emphasizing plausibility over exact match, this approach better aligns with human synthesis planning practices. We also introduce CREED, a novel dataset comprising millions of ChemCensor-validated reaction records for LLM training, and use it to train a model that improves over the LLM…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsMachine Learning in Materials Science · Computational Drug Discovery Methods · Chemical Synthesis and Analysis