Leveraging Machine Learning to Detect Data Curation Activities
Sara Lafia, Andrea Thomer, David Bleckley, Dharma Akmon, Libby, Hemphill

TL;DR
This paper develops a machine learning method to classify and analyze data curation activities from work logs at ICPSR, aiming to understand how curation impacts data reuse.
Contribution
It introduces a schema for data curation activities, a computational model for identifying these activities in logs, and analyzes their frequency over time.
Findings
A schema of data curation actions was created.
A text classifier was trained to detect curation activities.
Analysis revealed patterns in curation activities over time.
Abstract
This paper describes a machine learning approach for annotating and analyzing data curation work logs at ICPSR, a large social sciences data archive. The systems we studied track curation work and coordinate team decision-making at ICPSR. Repository staff use these systems to organize, prioritize, and document curation work done on datasets, making them promising resources for studying curation work and its impact on data reuse, especially in combination with data usage analytics. A key challenge, however, is classifying similar activities so that they can be measured and associated with impact metrics. This paper contributes: 1) a schema of data curation activities; 2) a computational model for identifying curation actions in work log descriptions; and 3) an analysis of frequent data curation activities at ICPSR over time. We first propose a schema of data curation actions to help us…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
