Meta simulation approach for evaluating machine learning method selection in data limited settings

Mostafa Alwash; Ghadi S. Al Hajj; Ivar Grytten; Geir Kjetil Sandve

PMC · DOI:10.1038/s41598-025-24627-y·November 19, 2025

Meta simulation approach for evaluating machine learning method selection in data limited settings

Mostafa Alwash, Ghadi S. Al Hajj, Ivar Grytten, Geir Kjetil Sandve

PDF

Open Access

TL;DR

This paper introduces a simulation framework to better evaluate machine learning methods in medical settings with limited data.

Contribution

The novel contribution is a meta-simulation framework called SimCalibration that improves ML benchmarking in data-scarce domains.

Findings

01

Structural learners vary in their ability to generate useful simulations for benchmarking.

02

Simulation-based benchmarking reduces variance in performance estimates compared to traditional validation.

03

In some cases, simulation-based rankings better reflect true ML performance than limited data.

Abstract

Selecting appropriate machine learning (ML) methods for domain-specific tasks remains a persistent challenge, particularly in medicine where datasets are often small, heterogeneous, and incomplete. Traditional benchmarking strategies rely on limited observational samples, which may not capture the complexity of the underlying data-generating process (DGP). As a result, methods that perform well on available data may generalise poorly in real-world practice. We present SimCalibration, a meta-simulation framework that leverages structural learners (SLs) to infer an approximated data-generating process from limited data and generate synthetic datasets for large-scale benchmarking. This framework enables systematic evaluation of machine learning method selection strategies in settings where the true data-generating process is either known or can be approximated, allowing both validation…

Linked entities

Genes, proteins, chemicals, diseases, species, mutations and cell lines named across the full text — each resolved to its canonical identifier and authoritative record.

Species1

Homo sapiens(human · species)

Diseases1

rare disease

Figures14

Click any figure to enlarge with its caption.

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsMachine Learning and Data Classification · Machine Learning in Healthcare · Domain Adaptation and Few-Shot Learning