Unassisted Noise Reduction of Chemical Reaction Data Sets

Alessandra Toniato; Philippe Schwaller; Antonio Cardinale; Joppe; Geluykens; Teodoro Laino

arXiv:2102.01399·cs.LG·February 3, 2021

Unassisted Noise Reduction of Chemical Reaction Data Sets

Alessandra Toniato, Philippe Schwaller, Antonio Cardinale, Joppe, Geluykens, Teodoro Laino

PDF

1 Repo

TL;DR

This paper introduces a novel machine learning method to automatically identify and remove chemically incorrect entries from reaction datasets, significantly improving the accuracy of predictive models without requiring chemical expertise.

Contribution

It presents the first unassisted, rule-free approach for noise reduction in chemical reaction data sets, enhancing model performance in synthetic chemistry tasks.

Findings

01

Retrosynthetic model accuracy increased by 13 percentage points.

02

Data cleaning reduced Jensen Shannon divergence by 30%.

03

Coverage remained high at 97%, with unchanged class-diversity.

Abstract

Existing deep learning models applied to reaction prediction in organic chemistry can reach high levels of accuracy (> 90% for Natural Language Processing-based ones). With no chemical knowledge embedded than the information learnt from reaction data, the quality of the data sets plays a crucial role in the performance of the prediction models. While human curation is prohibitively expensive, the need for unaided approaches to remove chemically incorrect entries from existing data sets is essential to improve artificial intelligence models' performance in synthetic chemistry tasks. Here we propose a machine learning-based, unassisted approach to remove chemically wrong entries from chemical reaction collections. We applied this method to the collection of chemical reactions Pistachio and to an open data set, both extracted from USPTO (United States Patent Office) patents. Our results…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

rxn4chemistry/OpenNMT-py
pytorchOfficial

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.