Knowledge or Reasoning? A Close Look at How LLMs Think Across Domains
Juncheng Wu, Sheng Liu, Haoqin Tu, Hang Yu, Xiaoke Huang, James Zou, Cihang Xie, Yuyin Zhou

TL;DR
This paper investigates how large language models reason across domains by analyzing their step-by-step thinking, focusing on knowledge correctness and reasoning quality, revealing domain-specific strengths and limitations of fine-tuning methods.
Contribution
It introduces a fine-grained evaluation framework for reasoning processes and provides insights into how different training methods affect reasoning and knowledge use in medical and mathematical domains.
Findings
R1-distilled models' reasoning does not transfer well to medical domain.
Supervised fine-tuning improves accuracy but reduces reasoning quality.
Reinforcement learning enhances medical reasoning by refining knowledge use.
Abstract
Recent advances in reasoning-enhanced Large Language Models such as OpenAI-o1/3 and DeepSeek-R1 have significantly improved performance on complex tasks. However, the quality and transparency of their internal reasoning processes remain underexplored. This work moves beyond the final-answer accuracy and investigates step-by-step reasoning in the medical and mathematical domains by explicitly decomposing the thinking trajectories into two parts: knowledge and reasoning. Specifically, we introduce a fine-grained evaluation framework that judges: (1) the correctness of knowledge used (measured by Knowledge Index (KI)) and (2) the quality of reasoning (measured by Information Gain (InfoGain)). Using this framework, we study R1-distilled and base Qwen models trained with supervised fine-tuning (SFT) and/or reinforcement learning (RL) in the medical and math domains. Three intriguing findings…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsERP Systems Implementation and Impact · Private Equity and Venture Capital · Big Data and Business Intelligence
