Semi-automatic staging area for high-quality structured data extraction from scientific literature
Luca Foppiano, Tomoya Mato, Kensei Terashima, Pedro Ortiz Suarez, Taku, Tou, Chikako Sakai, Wei-Sheng Wang, Toshiyuki Amagasa, Yoshihiko Takano,, Masashi Ishii

TL;DR
This paper introduces SuperCon2, a semi-automatic system for extracting and curating high-quality data on superconductors from scientific literature, combining machine learning and manual curation to improve database accuracy.
Contribution
The paper presents a novel curation interface and workflow that integrate automatic anomaly detection and training data collection to enhance data extraction from PDFs.
Findings
Significant improvement in curation precision and recall.
Effective training data collection reduces manual effort.
Enhanced interface increases curation efficiency.
Abstract
We propose a semi-automatic staging area for efficiently building an accurate database of experimental physical properties of superconductors from literature, called SuperCon2, to enrich the existing manually-built superconductor database SuperCon. Here we report our curation interface (SuperCon2 Interface) and a workflow managing the state transitions of each examined record, to validate the dataset of superconductors from PDF documents collected using Grobid-superconductors in a previous work. This curation workflow allows both automatic and manual operations, the former contains ``anomaly detection'' that scans new data identifying outliers, and a ``training data collector'' mechanism that collects training data examples based on manual corrections. Such training data collection policy is effective in improving the machine-learning models with a reduced number of examples. For manual…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsMachine Learning in Materials Science · Topic Modeling · Scientific Computing and Data Management
