CompleteRXN: Toward Completing Open Chemical Reaction Databases
Gabriel Vogel, Minouk Noordsij, Evgeny Pidko, Jana M. Weber

TL;DR
CompleteRXN introduces a large-scale benchmark for reaction completion, evaluating models on incomplete chemical reaction datasets to improve their robustness and applicability.
Contribution
The paper presents CompleteRXN, a new benchmark dataset for reaction completion, and evaluates models including a novel constrained decoding method, highlighting current limitations.
Findings
CRB achieves 99.20% accuracy on random split
SynRBL produces plausible completions but lower accuracy
Performance drops significantly on out-of-distribution reactions
Abstract
Chemical reaction datasets such as USPTO suffer from substantial incompleteness, frequently missing byproducts, co-reactants, and stoichiometric coefficients. This limits their applicability and reliability in downstream applications. Here, we introduce CompleteRXN, a large-scale supervised benchmark for reaction completion under realistic missing-data conditions. We construct a dataset of aligned incomplete and atom-balanced reactions by mapping USPTO records to curated mechanistic reactions. We evaluate representative baselines, including a novel encoder-decoder reaction completion model with constrained decoding, the Constrained Reaction Balancer (CRB), and a recent algorithmic method, SynRBL. On our CompleteRXN benchmark, the CRB achieves high performance across splits of increasing difficulty, reaching 99.20% equivalence accuracy on the random split and 91.12% on the extreme…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
