TL;DR
Lean4Physics introduces a comprehensive framework and benchmark for formal reasoning in college-level physics within Lean4, highlighting the challenges and potential for future AI-assisted physics reasoning.
Contribution
It is the first to provide a physics reasoning benchmark in Lean4, including a new library and baseline results with expert and state-of-the-art models.
Findings
Baseline models achieve low accuracy (16-35%)
PhysLib improves model performance by 11.75% on average
Lean4PHYS is the first formal physics benchmark in Lean4
Abstract
We present **Lean4PHYS**, a comprehensive reasoning framework for college-level physics problems in Lean4. **Lean4PHYS** includes *LeanPhysBench*, a college-level benchmark for formal physics reasoning in Lean4, which contains 200 hand-crafted and peer-reviewed statements derived from university textbooks and physics competition problems. To establish a solid foundation for formal reasoning in physics, we also introduce *PhysLib*, a community-driven repository containing fundamental unit systems and theorems essential for formal physics reasoning. Based on the benchmark and Lean4 repository we composed in **Lean4PHYS**, we report baseline results using major expert Math Lean4 provers and state-of-the-art closed-source models, with the best performance of DeepSeek-Prover-V2-7B achieving only 16% and Claude-Sonnet-4 achieving 35%. We also conduct a detailed analysis showing that our…
Peer Reviews
Decision·ICLR 2026 ConditionalPoster
The first framework for physics problems in Lean4 for LLM. This may be of significant interest to the community that studies LLM applications to physics. The PhysLib library and LeanPhysBench test dataset can be a useful artefact for future studies.
- Copyright infringement. - **Due to this issue, I decide to assign a very low score to the paper albeit the important contribution. However, I am very open to changing my score once this issue is clarified/addressed.** - The authors mentioned that “rather than copying the questions verbatim, we reformulated and rephrased them based on the underlying physics ideas", however, I am not certain that this is sufficient. The key issue hinges on "substantial similarity" and whether the origina
* **Significant Contribution to the Research Community:** This paper contributes two extremely valuable resources: **PhysLib**, a modular and extensible foundational library for physics, and **LeanPhysBench**, the first benchmark dedicated to evaluating formal reasoning capabilities in physics. These two achievements provide a solid infrastructure and a fair evaluation standard for subsequent researchers to enter and work in this field, which will undoubtedly promote the development of the ent
* **Insufficient Discussion of Related Work:** The "Related Work" section mentions Lean's application in physics and other non-mathematical fields but fails to deeply discuss the differences and connections with some directly related works. For example, the paper mentions the `PhysLean` [Tooby-Smith & contributors (2024)] project but only briefly describes it as "theorem-specific, small-scale, and non-modular." Considering that `PhysLean` also aims to formalize physics in Lean4, the authors sh
- The paper pioneers formal physics reasoning in Lean and introduces the first large-scale Lean4 physics benchmark covering topics from mechanics to modern physics. - PhysLib provides a modular, SI-based unit system and topic-structured theorems, ensuring dimensional consistency and extendability for accurate physical reasoning. - Comprehensive experiments show that PhysLib context consistently boosts model performance. LLMs outperform specialized Lean provers, revealing that current Lean pr
- The paper lacks an explanation of the advantages of PhysLib's modular structure over organizing theorems by specific physics domains. It also does not clarify how this hierarchical organization aids in retrieval and reasoning. - What defines the boundary between mathematical and physical problems, and why do Lean provers, which perform well in mathematics, fail to transfer their capabilities to the physics domain? - Would including non-competition problems, such as physics questions from midd
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
