Accessible Data Curation and Analytics for International-Scale Citizen Science Datasets
Benjamin Murray, Eric Kerfoot, Mark S. Graham, Carole H. Sudre, Erika, Molteni, Liane S. Canas, Michela Antonelli, Kerstin Klaser, Alessia Visconti,, Andrew T. Chan, Paul W. Franks, Richard Davies, Jonathan Wolf, Tim Spector,, Claire J. Steves, Marc Modat, Sebastien Ourselin

TL;DR
This paper introduces ExeTera, an open source software designed to facilitate scalable, reproducible data curation and analysis for large citizen science datasets like the Covid Symptom Study, which involves millions of participants.
Contribution
The paper presents ExeTera, a novel software tool that addresses scalability and reproducibility challenges in managing large-scale citizen science datasets.
Findings
ExeTera enables efficient processing of datasets with hundreds of millions of entries.
It improves reproducibility of analytics across multiple research publications.
The software is open source and adaptable for international citizen science projects.
Abstract
The Covid Symptom Study, a smartphone-based surveillance study on COVID-19 symptoms in the population, is an exemplar of big data citizen science. Over 4.7 million participants and 189 million unique assessments have been logged since its introduction in March 2020. The success of the Covid Symptom Study creates technical challenges around effective data curation for two reasons. Firstly, the scale of the dataset means that it can no longer be easily processed using standard software on commodity hardware. Secondly, the size of the research group means that replicability and consistency of key analytics used across multiple publications becomes an issue. We present ExeTera, an open source data curation software designed to address scalability challenges and to enable reproducible research across an international research group for datasets such as the Covid Symptom Study dataset.
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
