TL;DR
This paper introduces a novel machine learning method to automatically identify and remove chemically incorrect entries from reaction datasets, significantly improving the accuracy of predictive models without requiring chemical expertise.
Contribution
It presents the first unassisted, rule-free approach for noise reduction in chemical reaction data sets, enhancing model performance in synthetic chemistry tasks.
Findings
Retrosynthetic model accuracy increased by 13 percentage points.
Data cleaning reduced Jensen Shannon divergence by 30%.
Coverage remained high at 97%, with unchanged class-diversity.
Abstract
Existing deep learning models applied to reaction prediction in organic chemistry can reach high levels of accuracy (> 90% for Natural Language Processing-based ones). With no chemical knowledge embedded than the information learnt from reaction data, the quality of the data sets plays a crucial role in the performance of the prediction models. While human curation is prohibitively expensive, the need for unaided approaches to remove chemically incorrect entries from existing data sets is essential to improve artificial intelligence models' performance in synthetic chemistry tasks. Here we propose a machine learning-based, unassisted approach to remove chemically wrong entries from chemical reaction collections. We applied this method to the collection of chemical reactions Pistachio and to an open data set, both extracted from USPTO (United States Patent Office) patents. Our results…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
