Can Synthetic Data Improve Symbolic Regression Extrapolation Performance?

Fitria Wulandari Ramlan; Colm O'Riordan; Gabriel Kronberger; James McDermott

arXiv:2511.22794·cs.LG·December 1, 2025

Can Synthetic Data Improve Symbolic Regression Extrapolation Performance?

Fitria Wulandari Ramlan, Colm O'Riordan, Gabriel Kronberger, James McDermott

PDF

Open Access

TL;DR

This paper explores whether synthetic data generated via KDE and knowledge distillation can enhance the extrapolation capabilities of symbolic regression models, especially those trained with genetic programming, across various datasets.

Contribution

It introduces a method combining KDE and teacher-student training to generate synthetic data aimed at improving GP-based symbolic regression extrapolation performance.

Findings

01

GP models benefit from synthetic data in extrapolation regions.

02

Synthetic data improves GP performance more than other models.

03

Extrapolation improvements depend on dataset and teacher model used.

Abstract

Many machine learning models perform well when making predictions within the training data range, but often struggle when required to extrapolate beyond it. Symbolic regression (SR) using genetic programming (GP) can generate flexible models but is prone to unreliable behaviour in extrapolation. This paper investigates whether adding synthetic data can help improve performance in such cases. We apply Kernel Density Estimation (KDE) to identify regions in the input space where the training data is sparse. Synthetic data is then generated in those regions using a knowledge distillation approach: a teacher model generates predictions on new input points, which are then used to train a student model. We evaluate this method across six benchmark datasets, using neural networks (NN), random forests (RF), and GP both as teacher models (to generate synthetic data) and as student models (trained…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsEvolutionary Algorithms and Applications · Machine Learning and Data Classification · Machine Learning in Materials Science