An ELIXIR scoping review on domain-specific evaluation metrics for synthetic data in life sciences
Styliani-Christina Fragkouli, Somya Iqbal, Lisa Crossman, Barbara Gravel, Nagat Masued, Mark Onders, Devesh Haseja, Alex Stikkelman, Alfonso Valencia, Tom Lenaerts, Fotis Psomopoulos, Pilib Ó Broin, Núria Queralt-Rosinach, Davide Cirillo

TL;DR
This paper reviews how synthetic data is evaluated in life sciences and highlights the need for standardized metrics to ensure reliability and trustworthiness.
Contribution
The paper provides a systematic review of evaluation metrics for synthetic data across six life science domains, revealing gaps in current practices.
Findings
Synthetic data generation methods are evolving rapidly, but systematic evaluation is often neglected.
Current evaluation practices limit the ability to compare and trust synthetic datasets across domains.
There is a clear need for robust, standardized metrics to guide the responsible use of synthetic data.
Abstract
Synthetic data (SD) has become an increasingly important asset in the life sciences, helping address data scarcity, privacy concerns, and barriers to data access. Creating artificial datasets that mirror the characteristics of real data allows researchers to develop and validate computational methods in controlled environments. Despite its promise, the adoption of SD in life sciences hinges on rigorous evaluation metrics designed to assess their fidelity and reliability. To explore the current landscape of SD evaluation metrics in distinct life sciences domains, the ELIXIR Machine Learning Focus Group performed a systematic review of the scientific literature following the PRISMA guidelines. Six critical domains were examined to identify current practices for assessing SD. Findings reveal that, while generation methods are rapidly evolving, systematic evaluation is often overlooked,…
Click any figure to enlarge with its caption.
Figure 1
Figure 2
Figure 3Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsResearch Data Management Practices · Biomedical Text Mining and Ontologies
