SynthStrategy: Extracting and Formalizing Latent Strategic Insights from LLMs in Organic Chemistry
Daniel Armstrong, Zlatko Jon\v{c}ev, Andres M Bran, Philippe Schwaller

TL;DR
This paper presents SynthStrategy, a method using Large Language Models to extract, formalize, and test strategic insights in organic synthesis planning, improving route retrieval and analysis.
Contribution
It introduces a novel approach to formalize strategic synthesis principles as Python code, enabling interpretable, verifiable, and strategic-aware synthesis route analysis in CASP.
Findings
Achieved 75% Top-3 accuracy in route retrieval
Created a dataset with strategic annotations of synthesis routes
Enabled granular clustering and historical trend analysis
Abstract
Modern computer-assisted synthesis planning (CASP) systems show promises at generating chemically valid reaction steps but struggle to incorporate strategic considerations such as convergent assembly, protecting group minimization, and optimal ring-forming sequences. We introduce a methodology that leverages Large Language Models to distill synthetic knowledge into code. Our system analyzes synthesis routes and translates strategic principles into Python functions representing diverse strategic and tactical rules, such as strategic functional group interconversions and ring construction strategies. By formalizing this knowledge as verifiable code rather than simple heuristics, we create testable, interpretable representations of synthetic strategy. We release the complete codebase and the USPTO-ST dataset -- synthesis routes annotated with strategic tags. This framework unlocks a novel…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsMachine Learning in Materials Science · Synthetic Organic Chemistry Methods · Computational Drug Discovery Methods
