Can Test-time Computation Mitigate Reproduction Bias in Neural Symbolic Regression?
Shun Sato, Issei Sato

TL;DR
This paper investigates the limitations of neural symbolic regression, particularly the impact of reproduction bias and token generation issues, and demonstrates that test-time strategies can mitigate these biases to improve model robustness.
Contribution
The study provides a theoretical and empirical analysis of neural symbolic regression, identifying reproduction bias as a key limitation and proposing test-time strategies to reduce it.
Findings
Token-by-token generation is ineffective for NSR.
Reproduction bias restricts the search space in NSR.
Test-time strategies can mitigate reproduction bias.
Abstract
Mathematical expressions play a central role in scientific discovery. Symbolic regression aims to automatically discover such expressions from given numerical data. Recently, Neural symbolic regression (NSR) methods that involve Transformers pre-trained on synthetic datasets have gained attention for their fast inference, but they often perform poorly, especially with many input variables. In this study, we analyze NSR from both theoretical and empirical perspectives and show that (1) ordinary token-by-token generation is ill-suited for NSR, as Transformers cannot compositionally generate tokens while validating numerical consistency, and (2) the search space of NSR methods is greatly restricted due to reproduction bias, where the majority of generated expressions are merely copied from the training data. We further examine whether tailored test-time strategies can reduce reproduction…
Peer Reviews
Decision·Submitted to ICLR 2026
- The identified reproduction bias is significant and analyzed soundly from both theoretical and empirical perspective
- While the paper explores different strategies and approaches, no proposed solution, as acknowledged by the author in the conclusions, seems to both decrease the reproduction bias and increase accuracy. - Experiments are carried out with a maximum dataset size of 1.5M equations, whereas established work in the field trains these models with up to 100M. Hence, the conclusions might be biased by this difference in scale. - The connection between the theoretical and empirical evaluation settings i
**Identification and Empirical Demonstration of a Critical Issue**: The paper's greatest strength is its clear identification and systematic empirical demonstration of reproduction bias in NSR. The experimental design, using "not_included" and "baseline" datasets, effectively isolates the problem and shows its severe impact on generalization. **Rigorous Exploration of Test-Time Strategies**: The paper goes beyond merely identifying the problem by rigorously evaluating multiple test-time com
**Fundamental Flaw in the Definition and Evidence for Reproduction Bias**: The central concept of "reproduction bias" is potentially flawed. The criterion for bias—whether a generated expression tree structure is present in the training data—does not account for symbolic equivalence. Many expressions are functionally identical (e.g., x1*(x1+x2)and x1**2 + x1*x2). Therefore, a model generating an equivalent but syntactically different expression should be considered a success, not evidence of
The main strength is that it explicitly demonstrates memorization behavior in neural symbolic regression.
The weakness is that these insights are not new in 2025, as several recent studies have reached similar conclusions.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsMachine Learning in Materials Science · Evolutionary Algorithms and Applications · Generative Adversarial Networks and Image Synthesis
