A Comparative Study of Model Selection Criteria for Symbolic Regression
Ali Soltani, Gabriel Kronberger, Fabricio Olivetti de Franca, Mattia Billa, Alessandro Lucantonio

TL;DR
This paper systematically compares model selection criteria for symbolic regression, finding that MDL and BIC most effectively identify models with low test error and true underlying functions.
Contribution
It provides a comprehensive empirical evaluation of popular selection criteria like AIC, BIC, MDL, and others across multiple datasets in symbolic regression.
Findings
MDL consistently finds models with lowest test error and shortest length.
BIC and MDL have highest probability of selecting ground-truth expressions.
No single criterion outperforms others across all datasets.
Abstract
Effective model selection is critical in symbolic regression (SR) to identify mathematical expressions that balance accuracy and complexity, and have low expected error on unseen data. Many modern implementations of genetic programming (GP) for SR generate a set of Pareto optimal candidate solutions, but reliable automatic selection of solutions that generalize well remains an open issue. Current literature offers various information-theoretic and Bayesian approaches, yet comprehensive comparisons of their performance across different data regimes are limited. This study presents a systematic empirical comparison of widely used selection criteria: the Akaike information criterion (AIC), the corrected AIC (AICc), the Bayesian information criterion (BIC), minimum description length (MDL), as well as Efron's bootstrap estimate for the in-sample prediction error on seven synthetic datasets…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
