Toward Generalizable Machine Learning Models in Speech, Language, and Hearing Sciences: Estimating Sample Size and Reducing Overfitting
Hamzeh Ghasemzadeh, Robert E. Hillman, Daryush D. Mehta

TL;DR
This paper demonstrates that nested cross-validation provides more reliable estimates of model performance and sample size requirements in speech, language, and hearing sciences, reducing overfitting and improving generalizability.
Contribution
It introduces quantitative methods and MATLAB tools for power analysis and sample size estimation using nested cross-validation in ML studies.
Findings
Nested cross-validation yields higher statistical power and confidence.
Single holdout method significantly overestimates accuracy.
Sample size estimates can be 50% higher with proper validation.
Abstract
This study's first purpose is to provide quantitative evidence that would incentivize researchers to instead use the more robust method of nested cross-validation. The second purpose is to present methods and MATLAB codes for doing power analysis for ML-based analysis during the design of a study. Monte Carlo simulations were used to quantify the interactions between the employed cross-validation method, the discriminative power of features, the dimensionality of the feature space, and the dimensionality of the model. Four different cross-validations (single holdout, 10-fold, train-validation-test, and nested 10-fold) were compared based on the statistical power and statistical confidence of the ML models. Distributions of the null and alternative hypotheses were used to determine the minimum required sample size for obtaining a statistically significant outcome ({\alpha}=0.05,…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsNeural Networks and Applications
