Unsupervised Machine Learning for Scientific Discovery: Workflow and Best Practices
Andersen Chang, Tiffany M. Tang, Tarek M. Zikry, Genevera I. Allen

TL;DR
This paper proposes a standardized workflow for applying unsupervised machine learning in scientific research, emphasizing best practices for data handling, validation, and reproducibility to enhance discovery across various scientific domains.
Contribution
It introduces a comprehensive, structured workflow for unsupervised learning in science, including validation and reproducibility, demonstrated through an astronomy case study.
Findings
Validated the importance of rigorous validation in unsupervised learning
Demonstrated improved scientific discovery through structured workflow
Case study in astronomy highlights workflow effectiveness
Abstract
Unsupervised machine learning is widely used to mine large, unlabeled datasets to make data-driven discoveries in critical domains such as climate science, biomedicine, astronomy, chemistry, and more. However, despite its widespread utilization, there is a lack of standardization in unsupervised learning workflows for making reliable and reproducible scientific discoveries. In this paper, we present a structured workflow for using unsupervised learning techniques in science. We highlight and discuss best practices starting with formulating validatable scientific questions, conducting robust data preparation and exploration, using a range of modeling techniques, performing rigorous validation by evaluating the stability and generalizability of unsupervised learning conclusions, and promoting effective communication and documentation of results to ensure reproducible scientific…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsMachine Learning in Materials Science · Scientific Computing and Data Management · Research Data Management Practices
