Datatractor: Metadata, automation, and registries for extractor interoperability in the chemical and materials sciences

Matthew L. Evans; Gian-Marco Rignanese; David Elbert; Peter Kraus

arXiv:2410.18839·physics.data-an·June 24, 2025

Datatractor: Metadata, automation, and registries for extractor interoperability in the chemical and materials sciences

Matthew L. Evans, Gian-Marco Rignanese, David Elbert, Peter Kraus

PDF

2 Repos

TL;DR

Datatractor enhances FAIR data practices in chemical and materials sciences by creating a registry of data extractors, standardizing their descriptions, and providing a reference implementation to improve discoverability and usability.

Contribution

It introduces a standardized schema and a curated registry for data extractors, facilitating interoperability and ease of use in scientific data workflows.

Findings

01

Increased discoverability of data extractor tools

02

Standardized, machine-actionable descriptions of extractors

03

Reference implementation for data extraction workflows

Abstract

Two key issues hindering the transition towards FAIR data science are the poor discoverability and inconsistent instructions for the use of data extractor tools, i.e., how we go from raw data files created by instruments, to accessible metadata and scientific insight. If the existing format conversion tools are hard to find, install, and use, their reimplementation will lead to a duplication of effort, and an increase in the associated maintenance burden is inevitable. The Datatractor framework presented in this work addresses these issues. First, by providing a curated registry of such extractor tools their discoverability will increase. Second, by describing them using a standardised but lightweight schema, their installation and use is machine-actionable. Finally, we provide a reference implementation for such data extraction. The Datatractor framework can be used to provide a…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.