Extracting, Transforming and Archiving Scientific Data

Daniel Lemire; Andre Vellino

arXiv:1108.4041·cs.DL·August 24, 2011·2 cites

Extracting, Transforming and Archiving Scientific Data

Daniel Lemire, Andre Vellino

PDF

Open Access

TL;DR

This paper introduces the ETA model to automate the curation of large, heterogeneous research datasets, addressing challenges in extraction, transformation, and long-term archiving to improve digital research data management.

Contribution

The paper presents the ETA model, a scalable framework for automating research data curation, including novel strategies for extracting and archiving diverse legacy data.

Findings

01

Proposed ETA model effectively automates data curation tasks.

02

Scalable strategies for long-term storage of research data.

03

Review of existing solutions and new research directions.

Abstract

It is becoming common to archive research datasets that are not only large but also numerous. In addition, their corresponding metadata and the software required to analyse or display them need to be archived. Yet the manual curation of research data can be difficult and expensive, particularly in very large digital repositories, hence the importance of models and tools for automating digital curation tasks. The automation of these tasks faces three major challenges: (1) research data and data sources are highly heterogeneous, (2) future research needs are difficult to anticipate, (3) data is hard to index. To address these problems, we propose the Extract, Transform and Archive (ETA) model for managing and mechanizing the curation of research data. Specifically, we propose a scalable strategy for addressing the research-data problem, ranging from the extraction of legacy data to its…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsScientific Computing and Data Management · Research Data Management Practices · Data Quality and Management