Data-Efficient Symbolic Regression via Foundation Model Distillation

Wangyang Ying; Jinghan Zhang; Haoyue Bai; Nanxu Gong; Xinyuan Wang; Kunpeng Liu; Chandan K. Reddy; Yanjie Fu

arXiv:2508.19487·cs.LG·August 28, 2025

Data-Efficient Symbolic Regression via Foundation Model Distillation

Wangyang Ying, Jinghan Zhang, Haoyue Bai, Nanxu Gong, Xinyuan Wang, Kunpeng Liu, Chandan K. Reddy, Yanjie Fu

PDF

TL;DR

EQUATE is a novel framework that fine-tunes foundation models for symbolic regression in low-data scenarios by combining symbolic-numeric alignment with embedding optimization, leading to improved accuracy and robustness.

Contribution

We introduce EQUATE, a data-efficient fine-tuning method that reformulates symbolic search as a continuous optimization in embedding space, enhancing symbolic regression from limited data.

Findings

01

EQUATE outperforms state-of-the-art baselines in accuracy and robustness.

02

EQUATE maintains low complexity and enables fast inference.

03

EQUATE demonstrates effectiveness across multiple benchmark datasets.

Abstract

Discovering interpretable mathematical equations from observed data (a.k.a. equation discovery or symbolic regression) is a cornerstone of scientific discovery, enabling transparent modeling of physical, biological, and economic systems. While foundation models pre-trained on large-scale equation datasets offer a promising starting point, they often suffer from negative transfer and poor generalization when applied to small, domain-specific datasets. In this paper, we introduce EQUATE (Equation Generation via QUality-Aligned Transfer Embeddings), a data-efficient fine-tuning framework that adapts foundation models for symbolic equation discovery in low-data regimes via distillation. EQUATE combines symbolic-numeric alignment with evaluator-guided embedding optimization, enabling a principled embedding-search-generation paradigm. Our approach reformulates discrete equation search as a…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.