ReJump: A Tree-Jump Representation for Analyzing and Improving LLM Reasoning
Yuchen Zeng, Shuibai Zhang, Wonjun Kang, Shutong Wu, Lynnix Zou, Ying Fan, Heeju Kim, Ziqian Lin, Jungtaek Kim, Hyung Il Koo, Dimitris Papailiopoulos, Kangwook Lee

TL;DR
ReJump introduces a tree-based representation of LLM reasoning traces, enabling detailed analysis of reasoning behaviors and strategies to enhance reasoning quality through targeted interventions.
Contribution
This work presents ReJump, a novel tree-jump representation for analyzing and improving LLM reasoning, providing new metrics and methods for understanding reasoning dynamics.
Findings
Models with similar accuracy show different reasoning behaviors.
Different tasks favor different reasoning styles.
ReJump-guided strategies can improve reasoning quality.
Abstract
Large Reasoning Models (LRMs) are Large Language Models (LLMs) explicitly trained to generate long-form Chain-of-Thoughts (CoTs), achieving impressive success on challenging tasks like math and programming. However, their underlying reasoning "algorithms" remain poorly understood. To investigate this, we propose ReJump, which represents a reasoning trace as a visitation order over nodes in a tree of intermediate problem-solving steps. Transitions between nodes, which we term jumps, include adjacent moves that capture behaviors such as calculation, and non-adjacent moves that capture behaviors such as backtracking and verification. ReJump enables analyzing LLM reasoning with diverse metrics that quantify exploration, exploitation, overthinking, forgetting, and verification. Using our proposed LLM agent to extract reasoning traces into ReJump format, we evaluate state-of-the-art LRMs on…
Peer Reviews
Decision·Submitted to ICLR 2026
- S1. The graph representation methodology is interesting and the proposed metrics based on this graph representation are intuitive and reasonable. - S2. The graph representations derived by Gemini 2.5 Pro are verified by human evaluation, and the correlation with human judgement is reasonable. - S3. Comparison on five main model families and various model variants - S4. The results quantitatively explain existing insights on *general* reasoning behaviors and show novel *task-specific* insights.
- W1. The method is complex and costly, given that it is task specific. Applying this analysis to a novel task requires (1) a set of samples for the given task (the authors mention that they use 70 samples from each task), (2) funds for API costs (the authors mention they used approximately $2000 across all experiments), and (3) hand-crafting new prompts to adapt the methodology according to the task (mentioned in the limitations). - Regarding the findings on *general* reasoning patterns of reas
1. The approach is both methodologically novel and conceptually comprehensive. The paper introduces ReJump, a dual-layer representation (tree + jump) that models reasoning both structurally and dynamically, offering a novel and interpretable lens on LLM reasoning. Alongside this, the authors define six behavioral metrics (e.g., jump distance, verification rate, overthinking rate) and two similarity metrics that together form a comprehensive toolkit for quantitatively analyzing exploration–exploi
1. ReJump extraction requires prompting a large LLM (e.g., Gemini 2.5 Pro), leading to high cost and poor scalability for large-scale or real-time analysis. 2. Experiments focus narrowly on mathematical reasoning and arithmetic-style problems; results on commonsense, coding, or multi-hop reasoning would strengthen generality. 3. Task-specific prompt engineering is still needed to define what a “partial solution node” is; automatic adaptation to new domains (e.g., logic, coding) is not yet solv
This is a work about a proposed framework for converting LLM-generated reasoning traces. This is an interesting direction for refining the understanding on LLM reasoning. It is an interesting design to define both the two types of similarity measurement, covering both the content semiotics and reasoning jump patterns A series experimental studied were conducted using the proposed framework, and the authors shared insights/findings from the work
In the Tree similarity (Sim_T) definition, the authors introduces tree edit distance, a variant of graph edit distance. How to handle the difference in the nodes, say, corresponding to the newly def The author should elaborate how to assess the set of metrics proposed for the evaluation. All the proposed metrics look fine, but how to justify they are not overlapping, and jointly cover all key aspects for the evaluation. It is unclear if the metrics proposed, or the evaluation approach overall
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsMachine Learning in Materials Science · Topic Modeling · Multimodal Machine Learning Applications
