Demystifying Scientific Problem-Solving in LLMs by Probing Knowledge and Reasoning

Alan Li; Yixin Liu; Arpan Sarkar; Doug Downey; Arman Cohan

arXiv:2508.19202·cs.CL·January 21, 2026

Demystifying Scientific Problem-Solving in LLMs by Probing Knowledge and Reasoning

Alan Li, Yixin Liu, Arpan Sarkar, Doug Downey, Arman Cohan

PDF

1 Datasets 4 Reviews

TL;DR

This paper introduces new benchmarks and a probing framework to evaluate and analyze how large language models perform scientific reasoning, emphasizing the roles of knowledge retrieval and reasoning enhancement.

Contribution

It presents SciReas and SciReas-Pro benchmarks for scientific reasoning, and KRUX, a framework to disentangle knowledge and reasoning roles in LLMs.

Findings

01

Retrieving task-relevant knowledge is a key bottleneck.

02

External knowledge improves reasoning performance.

03

Verbalized reasoning enhances knowledge surfaceability.

Abstract

Scientific problem solving poses unique challenges for LLMs, requiring both deep domain knowledge and the ability to apply such knowledge through complex reasoning. While automated scientific reasoners hold great promise for assisting human scientists, there is currently no widely adopted holistic benchmark for evaluating scientific reasoning, and few approaches systematically disentangle the distinct roles of knowledge and reasoning in these tasks. To address these gaps, we introduce SciReas, a diverse suite of existing benchmarks for scientific reasoning tasks, and SciReas-Pro, a selective subset that requires more complex reasoning. Our holistic evaluation surfaces insights about scientific reasoning performance that remain hidden when relying on individual benchmarks alone. We then propose KRUX, a probing framework for studying the distinct roles of reasoning and knowledge in…

Peer Reviews

Decision·ICLR 2026 Conference Desk Rejected Submission

Reviewer 01Rating 6Confidence 4

Strengths

- Quality: The experimental design in Section 4 is rigorous. The use of controlled supervised fine-tuning (SFT) on Math, STEM, and BOTH data subsets allows the authors to make more robust claims about the impact of different training data. It also reports the performance of two different model families (Qwen and Llama). The analysis in RQ3 also reliably distinguish between acquiring new knowledge versus better surfacing of existing knowledge (Tables 4 & 5). - Originality: While the goal of dis

Weaknesses

- The definition of "Knowledge Ingredients" (KIs) is operational and lack of formal definition. While the paper describes KIs as "essential atomic knowledge units" (line 346), this definition remains vague. In practice, a KI is simply the output of an extractor model (DeepSeek-R1) given a specific prompt (Figure 12). This makes the central concept of the KRUX framework dependent on the specific extractor model and prompt used, which could affect reproducibility and the generalizability of the fi

Reviewer 02Rating 4Confidence 4

Strengths

- The study of knowledge recall vs other reasoning effects is an interesting direction that reveals the direction for building models with more reliable reasoning ability. - The result shows the effectiveness of directly reinforcing the knowledge ingredient through verbalization.

Weaknesses

- The separation of datasets is unclear, the stem split also contains math problems, which blurs the improvements from models. - The knowledge ingredients introduced from DeepSeek-R1 serves as a distillation-like role. Although there are supporting experiments that shows directly applying those ingredients do not lead reasoning gains. More evidence would be helpful to show how these information are utilized and take into effect. - Although models tuned with math data demonstrates performance ga

Reviewer 03Rating 8Confidence 4

Strengths

1. The paper extracts knowledge from the reasoning track of a reasoning model and performs fine-grained processing. This approach is simple yet effective. 2. The significance is that when external knowledge is supplied (KIs), even non-reasoning models exhibit a remarkably large improvement (≥10%). Nevertheless, reasoning-enhanced models still outperform them when both are given the same KIs, indicating additive and complementary benefits. This highlights an important insight for system design —

Weaknesses

1. The selection process of SCIREAS-PRO relies on proprietary models and may implicitly bundle other behaviors. How capable are open-source models in this aspect? Is this filtering method model-idiosyncratic? 2. How consistent are the KIs extracted across different models, and how diverse are they?

Reviewer 04Rating 6Confidence 2

Strengths

1. This work is the first to construct a unified, standardized, and reasoning-intensive evaluation suite for scientific reasoning, achieving high-quality filtering via a subtask-level exclusion protocol. 2. The KRUX framework is innovative: by fixing knowledge input (KIs) and varying the target model, it achieves controlled separation of knowledge and reasoning, avoiding spurious correlations. The KIs are extracted from real reasoning traces, ensuring ecological validity. 3. The experiments are

Weaknesses

The design of the KRUX framework and the phrasing of its experimental conclusions are somewhat ambiguous. The claim that “even when the KIs are already known by the base model, reasoning fine-tuning still improves performance” (lines 93–97) seems to lack supporting evidence of zero-shot recall of those KIs by the base model.

Code & Models

Datasets

yale-nlp/SciReas-Pro
dataset· 25 dl
25 dl

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.