Autonomous data extraction from peer reviewed literature for training machine learning models of oxidation potentials
Siwoo Lee, Stefan Heinen, Danish Khan, O. Anatole von, Lilienfeld

TL;DR
This paper introduces an automated pipeline combining neural networks and language models to extract and analyze oxidation potential data from literature, enabling large-scale predictions and insights with minimal manual effort.
Contribution
The authors developed a novel automated data extraction pipeline that significantly reduces manual labor and enables large-scale prediction of oxidation potentials using machine learning.
Findings
Achieved prediction errors of ~0.2 V, comparable to experimental uncertainty.
Predicted oxidation potentials for ~132,000 molecules from QM9 dataset.
Identified trends: aliphaticity increases oxidation potential, more heavy atoms decrease it.
Abstract
We present an automated data-collection pipeline involving a convolutional neural network and a large language model to extract user-specified tabular data from peer-reviewed literature. The pipeline is applied to 74 reports published between 1957 and 2014 with experimentally-measured oxidation potentials for 592 organic molecules (-0.75 to 3.58 V). After data curation (solvents, reference electrodes, and missed data points), we trained multiple supervised machine learning models reaching prediction errors similar to experimental uncertainty (0.2 V). For experimental measurements of identical molecules reported in multiple studies, we identified the most likely value based on out-of-sample machine learning predictions. Using the trained machine learning models, we then estimated oxidation potentials of 132k small organic molecules from the QM9 data set, with predicted values…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsComputational Drug Discovery Methods · Machine Learning in Materials Science · Metabolomics and Mass Spectrometry Studies
