Position Paper on Dataset Engineering to Accelerate Science
Emilio Vital Brazil, Eduardo Soares, Lucas Villa Real, Leonardo, Azevedo, Vinicius Segura, Luiz Zerkowski, and Renato Cerqueira

TL;DR
This paper advocates for treating datasets as first-class entities in scientific workflows, emphasizing their lifecycle management to accelerate discovery, especially when using AI methods, illustrated through material discovery as a case study.
Contribution
It introduces a conceptual framework for integrating datasets as central elements in scientific discovery processes, highlighting their lifecycle and management needs.
Findings
Datasets are crucial for structured scientific discovery.
Effective dataset management accelerates AI-driven research.
Material discovery exemplifies the importance of dataset lifecycle.
Abstract
Data is a critical element in any discovery process. In the last decades, we observed exponential growth in the volume of available data and the technology to manipulate it. However, data is only practical when one can structure it for a well-defined task. For instance, we need a corpus of text broken into sentences to train a natural language machine-learning model. In this work, we will use the token \textit{dataset} to designate a structured set of data built to perform a well-defined task. Moreover, the dataset will be used in most cases as a blueprint of an entity that at any moment can be stored as a table. Specifically, in science, each area has unique forms to organize, gather and handle its datasets. We believe that datasets must be a first-class entity in any knowledge-intensive process, and all workflows should have exceptional attention to datasets' lifecycle, from their…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsScientific Computing and Data Management · Genetics, Bioinformatics, and Biomedical Research · Research Data Management Practices
