TL;DR
This paper evaluates advanced reasoning LLMs on complex physics problems, demonstrating state-of-the-art accuracy and the importance of symbolic reasoning and few-shot prompting for improved performance.
Contribution
It provides a comprehensive analysis of reasoning models like Deepseek-R1 on physics problems, highlighting their symbolic derivation capabilities and benefits of few-shot prompting.
Findings
Achieved state-of-the-art accuracy on SciBench physics problems.
Models generate distinctive symbolic derivation reasoning patterns.
Few-shot prompting further improves model accuracy.
Abstract
Navigating the complexities of physics reasoning has long been a difficult task for Large Language Models (LLMs), requiring a synthesis of profound conceptual understanding and adept problem-solving techniques. In this study, we investigate the application of advanced instruction-tuned reasoning models, such as Deepseek-R1, to address a diverse spectrum of physics problems curated from the challenging SciBench benchmark. Our comprehensive experimental evaluation reveals the remarkable capabilities of reasoning models. Not only do they achieve state-of-the-art accuracy in answering intricate physics questions, but they also generate distinctive reasoning patterns that emphasize on symbolic derivation. Furthermore, our findings indicate that even for these highly sophisticated reasoning models, the strategic incorporation of few-shot prompting can still yield measurable improvements in…
Peer Reviews
Decision·ICLR 2026 Conference Withdrawn Submission
1. Most existing studies on reasoning LLMs concentrate on mathematics, logic puzzles, or code synthesis. This paper’s focus on physics, thereby, broadening the empirical scope of reasoning model evaluation. 2. The evaluation pipeline (zero-shot vs. few-shot CoT) is well described, using publicly available datasets (SciBench). The inclusion of multiple Deepseek variants and comparison with baselines like GPT-4-Turbo adds breadth. 3. The paper includes appendices with prompt templates, paramete
1. This is primarily an evaluation work and lacks novelty. The authors evaluate pre-existing reasoning models (Deepseek-R1 and distill variants) on a known benchmark (SciBench). 2. The “symbolic vs. numeric” observation, while intuitive, is anecdotal and not systematically analyzed or quantified. 3. The evaluation is restricted to SciBench and unimodal, text-only physics questions, excluding diagrams, visual reasoning, or multimodal tasks that are central to real-world physics understanding.
The paper is clearly written and the experimental setup is well explained. The authors describe datasets, prompting methods, and evaluation criteria in detail, which makes the work reproducible. The focus on symbolic versus numerical reasoning is conceptually interesting and could inspire more interpretable studies of model reasoning behavior. The inclusion of both large and distilled models also gives a decent view of how scaling and compression affect reasoning. Visuals and tables are easy to
- Limited novelty. The work mostly reuses existing datasets and prompting methods without introducing a clear methodological contribution. The idea of comparing symbolic and numerical reasoning is interesting but remains descriptive. There is no formal way to define or quantify what counts as “symbolic reasoning,” which weakens the main claim. - Lack of deeper insight. The analysis of results is mainly observational. The authors describe trends but do not provide clear explanations or theoretica
1. Novel empirical observation: the study surfaces a clear behavioral phenomenon—reasoning-oriented models tend to adopt symbolic derivation in physics problem solving, which may be directly linked to better performance; few-shot CoT continues to provide gains. 2. Clear research question and reasonably described setup: the work explicitly focuses on “how LLMs solve physics problems,” differentiating zero-shot vs. few-shot CoT and paying attention to intermediate reasoning chains. 3. Goes beyon
1. Limited methodological innovation and over-reliance on prompting: the work primarily probes models via prompt variants (zero-/few-shot CoT) and descriptive frequency analyses, without introducing new algorithms, training procedures, tools, or a formal methodology. The conclusions are largely correlational and risk conflating correlation with causation. 2. Fairness of comparisons is compromised: the evaluated subset is filtered (Section 3.2), yet baseline numbers are taken from Chen et al. (2
I read this paper twice, and can not find any strengths.
1. Many Conceptual Errors: + I do not know why the authors call the reasoning models "instruction-tuned reasoning models." This is very unprofessional. They have many names, but the authors consistently write the wrong name throughout this paper. + Many conceptual errors in L145-L147. I can not get what the authors want to express. This sentence is clearly written by an LLM, possibly a very weak LLM. + In L43-44, "These recent models are specifically optimized through extensive instruction tunin
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
