# Knowledge integration for physics-informed symbolic regression using pre-trained large language models

**Authors:** Bilge Taskin, Wenxiong Xie, Teddy Lazebnik

PMC · DOI: 10.1038/s41598-026-35327-6 · Scientific Reports · 2026-01-13

## TL;DR

This paper shows how pre-trained large language models can help integrate domain knowledge into symbolic regression, making it easier and more effective for scientific discovery.

## Contribution

The novel approach integrates LLMs into symbolic regression's loss function to automate domain knowledge incorporation.

## Key findings

- LLM integration consistently improves the reconstruction of physical dynamics from data.
- More informative prompts significantly enhance the performance of the method.
- The approach works across multiple SR algorithms and physical systems.

## Abstract

Symbolic regression (SR) has emerged as a powerful tool for automated scientific discovery, enabling the derivation of governing equations from experimental data. A growing body of work illustrates the promise of integrating domain knowledge into the SR to improve the discovered equation’s generality and usefulness. Physics-informed SR (PiSR) addresses this by incorporating domain knowledge, but current methods often re- quire specialized formulations and manual feature engineering, limiting their adaptability only to domain experts. In this study, we leverage pre-trained Large Language Models (LLMs) to facilitate knowledge integration in PiSR. By harnessing the contextual understanding of LLMs trained on vast scientific literature, we aim to automate the incorporation of domain knowledge, reducing the need for manual intervention and making the process more accessible to a broader range of scientific problems. Namely, the LLM is integrated into the SR’s loss function, adding a term of the LLM’s evaluation of the SR’s produced equation. We extensively evaluate our method using three SR algorithms (DEAP, gplearn, and PySR) and three pre-trained LLMs (Falcon, Mistral, and LLama 2) across three physical dynamics (dropping ball, simple harmonic motion, and electromagnetic wave). The results demonstrate that LLM integration consistently improves the reconstruction of physical dynamics from data, enhancing the robustness of SR models to noise and complexity. We further explore the impact of prompt engineering, finding that more informative prompts significantly improve performance.

The online version contains supplementary material available at 10.1038/s41598-026-35327-6.

## Full-text entities

- **Diseases:** SR (MESH:C537770), LLMs (MESH:D007806)
- **Chemicals:** Silico (-)
- **Species:** Homo sapiens (human, species) [taxon 9606]

## Full text

_Full body text omitted from this summary view._ Fetch the complete paper as Markdown: https://tomesphere.com/paper/PMC12800073/full.md

## Figures

3 figures with captions in the complete paper: https://tomesphere.com/paper/PMC12800073/full.md

## References

43 references — full list in the complete paper: https://tomesphere.com/paper/PMC12800073/full.md

---
Source: https://tomesphere.com/paper/PMC12800073