TL;DR
This paper introduces SymbArena, a large dataset and benchmark for fine-tuning LLMs in symbolic regression, demonstrating that fine-tuned LLMs can outperform traditional methods in accuracy and form consistency.
Contribution
It presents SymbArena, a large-scale dataset and benchmark for SR, and shows that fine-tuning LLMs significantly improves their symbolic regression performance.
Findings
Fine-tuned LLMs outperform traditional numerical methods in accuracy.
SymbArena enables effective training and evaluation of LLMs for SR.
Symbolic-R1 surpasses previous LLM baselines in form accuracy and numerical precision.
Abstract
Deriving governing equations from observational data, known as Symbolic Regression (SR), is a cornerstone of scientific discovery. Large Language Models, (LLMs) have shown promise in this task by leveraging their vast cross-disciplinary scientific knowledge. However, existing LLM-based methods primarily rely on direct inference or prompt engineering, often requiring excessive inference iterations to converge on correct formulas or failing to treat complex equation targets. These limitations in effectiveness and generalization stem from an inherent tension between pre-trained LLMs' proficiency in approximate reasoning and the high-precision demands of SR tasks. To bridge this gap, we propose to fine-tune LLMs for enhanced SR capability. Yet, the absence of dedicated datasets for SR-oriented fine-tuning remains a critical barrier. We thus introduce SymbArena, specifically engineered to…
Peer Reviews
Decision·ICLR 2026 Conference Withdrawn Submission
- The paper is generally well-written and easy to follow. - Figures and visual explanations (especially Figure 3) clearly present the overall pipeline and make the methodology understandable. - The motivation of enhancing LLM backbones for symbolic regression through synthetic fine-tuning is well-motivated and interesting.
- What test data is used for the results in Tables 2 and 3? If the evaluations are conducted only on SymbArena, even with a held-out split, testing on data from the same generation distribution does not convincingly demonstrate generalization or fair comparison to other baselines. I would like to see evaluation of the fine-tuned model on LLM-SRBench [1], which provides a more diverse and domain-general benchmark for LLM-based equation discovery. - Since LLM-SRBench also reports symbolic accurac
+ The dataset scale is impressive and could become a useful SR resource. + The dual metric design (numeric + structural) addresses a genuine limitation of past SR benchmarks. + The overall pipeline is ambitious and well-motivated from an empirical standpoint.
- The core idea (IFT + GRPO + iterative refinement) is a straightforward combination of existing methods. I have serious concerns about the lack of algorithmic innovation. - Synthetic-only evaluation. All experiments are on a self-generated dataset; no validation on SRBench, Nguyen, or real scientific equations. Claims of generalization are unsupported. - The GPT-4o adjudicator is non-reproducible. Using a closed-source model to score results is scientifically unsound. - The “reality enhancement
* The core idea of fine-tuning LLMs specifically for symbolic regression is interesting and relatively underexplored compared to inference-time scaling approaches. * The GRPO training scheme with multiple reward types is a clever design that tries to balance structural correctness with numerical accuracy. * The reality-verification step for test equations (using LLM to check similarity to known physics) to ensure practical relevance of the benchmark seems novel and promising.
**Problem formulation.** The paper focuses on traditional SR without relying on domain knowledge (finding equations from data only) but the positioning is confusing. LLM-SR and SGA are designed for context-rich problems and seem tested outside their scope here. If we consider the contribution as "fine-tuning" the model for SR, one naturally thinks of transformer-based methods like E2E which, while trained on much more data, struggle with symbolic recovery. So it is not clear if the advantage her
- Exploring LLMs for Symbolic regression is an interesting topic, as it in theory LLMs allows for combining textual and numerical information. - Their ablation study on the different training stages and inference strategies is extensive. - Their method is straightforward and easy to understand.
- The main weakness is the reliance on the Symbolic-R1 test set for the main results. As this set shares the same distribution as the training data, improved performance is expected. This leaves unanswered the core question of whether performance generalizes to other distributions. - While the paper presents the new dataset as a contribution, the data creation pipeline itself is not novel. Consequently, the paper's primary contribution is the multiple step strategy of finetuning the LLM on this
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
