When Reasoning Meets Its Laws

Junyu Zhang; Yifan Sun; Tianang Leng; Jingyan Shen; Liu Ziyin; Paul Pu Liang; Huan Zhang

arXiv:2512.17901·cs.AI·December 22, 2025

When Reasoning Meets Its Laws

Junyu Zhang, Yifan Sun, Tianang Leng, Jingyan Shen, Liu Ziyin, Paul Pu Liang, Huan Zhang

PDF

Open Access 3 Reviews

TL;DR

This paper introduces the Laws of Reasoning (LoRe), a theoretical framework for understanding and improving reasoning behaviors in large reasoning models through properties like monotonicity and compositionality.

Contribution

It formalizes reasoning behaviors with LoRe, proposes a benchmark LoRe-Bench for property measurement, and develops a finetuning method to enhance model reasoning by enforcing the compute law.

Findings

01

Models generally show monotonic reasoning but lack compositionality.

02

Enforcing the compute law improves reasoning performance.

03

Better compliance with LoRe laws leads to consistent reasoning improvements.

Abstract

Despite the superior performance of Large Reasoning Models (LRMs), their reasoning behaviors are often counterintuitive, leading to suboptimal reasoning capabilities. To theoretically formalize the desired reasoning behaviors, this paper presents the Laws of Reasoning (LoRe), a unified framework that characterizes intrinsic reasoning patterns in LRMs. We first propose compute law with the hypothesis that the reasoning compute should scale linearly with question complexity. Beyond compute, we extend LoRe with a supplementary accuracy law. Since the question complexity is difficult to quantify in practice, we examine these hypotheses by two properties of the laws, monotonicity and compositionality. We therefore introduce LoRe-Bench, a benchmark that systematically measures these two tractable properties for large reasoning models. Evaluation shows that most reasoning models exhibit…

Peer Reviews

Decision·Submitted to ICLR 2026

Reviewer 01Rating 6Confidence 3

Strengths

This paper investigates a practically important problem: How the compute and accuracy of LLM vary with question complexity. In addition, the authors not only evaluate the existing methods, but also propose a novel finetuning method to advance the existing models. The presentation is clear, and the figures and tables are well organized and visually appealing.

Weaknesses

The complexity construction of the data is not very consistent with the definition of complexity in the method part. I would say that the Turing machine steps seem to be a more reasonanle way to define the complexity. While the number of steps used in the datasets requires more justification.

Reviewer 02Rating 2Confidence 3

Strengths

a. This paper is well-written and easy to follow. b. This paper proposes a theoretical framework focusing on the relationship between problem complexity with accuracy and computation. The authors find that current models exhibit reasonable monotonicity but lack compositionality.

Weaknesses

a. The evaluation is restricted to models with less than 10 billion parameters. It is a critical question whether significantly larger models inherently follow the proposed compositionality law due to emergent capabilities, or if SFT-Compo remains necessary for them. b. LORE-COMPO (two totally independent questions followed by an arbitrary combination) is rare in natural language. It contradicts with the example in figure 1. The compositional question is that square the answer of the first ques

Reviewer 03Rating 6Confidence 3

Strengths

- This paper proposes LoRe, which characterizes reasoning laws through monotonicity and compositionality, enriching the evaluation dimensions. - This paper introduces the SFT-Compo method, which is simple yet effective, significantly improving models’ reasoning compositionality and overall performance.

Weaknesses

1. The experiments in the paper are mainly conducted on closed-source small-scale models, and the current results may be insufficient to demonstrate their applicability to larger-scale or different types of models. 2. In SFT-Compo, the training triplets appear to be generated solely using DeepSeek-R1-Distill-Qwen-14B. Such reliance on a single teacher model may limit data diversity and weaken the robustness of the conclusions. Have the authors considered incorporating multiple teacher models or

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsConstraint Satisfaction and Optimization · Explainable Artificial Intelligence (XAI) · AI-based Problem Solving and Planning