Data Aggregation for Reducing Training Data in Symbolic Regression
Lukas Kammerer, Gabriel Kronberger, Michael Kommenda

TL;DR
This paper explores data aggregation techniques like k-means clustering and binning to reduce training data size in symbolic regression, aiming to improve computational efficiency while maintaining accuracy.
Contribution
It introduces data aggregation methods for symbolic regression, demonstrating their effectiveness in reducing training data and runtime with minimal accuracy loss.
Findings
K-means and random sampling maintain accuracy with only 30% data.
Speed-up is proportional to data reduction.
Binning causes high test error.
Abstract
The growing volume of data makes the use of computationally intense machine learning techniques such as symbolic regression with genetic programming more and more impractical. This work discusses methods to reduce the training data and thereby also the runtime of genetic programming. The data is aggregated in a preprocessing step before running the actual machine learning algorithm. K-means clustering and data binning is used for data aggregation and compared with random sampling as the simplest data reduction method. We analyze the achieved speed-up in training and the effects on the trained models test accuracy for every method on four real-world data sets. The performance of genetic programming is compared with random forests and linear regression. It is shown, that k-means and random sampling lead to very small loss in test accuracy when the data is reduced down to only 30% of the…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
Methodsk-Means Clustering
