Test-time Scaling Techniques in Theoretical Physics -- A Comparison of Methods on the TPBench Dataset

Zhiqi Gao; Tianyi Li; Yurii Kvasiuk; Sai Chaitanya Tadepalli; Maja Rudolph; Daniel J.H. Chung; Frederic Sala; Moritz M\"unchmeyer

arXiv:2506.20729·cs.LG·June 27, 2025

Test-time Scaling Techniques in Theoretical Physics -- A Comparison of Methods on the TPBench Dataset

Zhiqi Gao, Tianyi Li, Yurii Kvasiuk, Sai Chaitanya Tadepalli, Maja Rudolph, Daniel J.H. Chung, Frederic Sala, Moritz M\"unchmeyer

PDF

Open Access

TL;DR

This paper evaluates and compares test-time scaling methods for large language models in advanced theoretical physics, introducing a symbolic weak-verifier framework that improves performance on physics and mathematical reasoning benchmarks.

Contribution

It introduces a novel symbolic weak-verifier framework that enhances test-time scaling effectiveness in physics reasoning tasks, demonstrating superior performance on TPBench and AIME datasets.

Findings

01

The symbolic weak-verifier significantly outperforms existing methods on TPBench.

02

Test-time scaling methods can generalize from mathematical to physics reasoning.

03

The proposed approach improves parallel scaling results in complex scientific problems.

Abstract

Large language models (LLMs) have shown strong capabilities in complex reasoning, and test-time scaling techniques can enhance their performance with comparably low cost. Many of these methods have been developed and evaluated on mathematical reasoning benchmarks such as AIME. This paper investigates whether the lessons learned from these benchmarks generalize to the domain of advanced theoretical physics. We evaluate a range of common test-time scaling methods on the TPBench physics dataset and compare their effectiveness with results on AIME. To better leverage the structure of physics problems, we develop a novel, symbolic weak-verifier framework to improve parallel scaling results. Our empirical results demonstrate that this method significantly outperforms existing test-time scaling approaches on TPBench. We also evaluate our method on AIME, confirming its effectiveness in solving…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsNeural Networks and Applications · Time Series Analysis and Forecasting · Advanced Text Analysis Techniques