Chain of Execution Supervision Promotes General Reasoning in Large Language Models
Nuo Chen, Zehua Li, Keqin Bao, Junyang Lin, Dayiheng Liu

TL;DR
This paper introduces TracePile, a large dataset transforming code execution into explicit reasoning steps, which improves the reasoning abilities of large language models across multiple domains and benchmarks.
Contribution
The paper presents TracePile, a novel large-scale corpus that converts code execution into explicit rationales, enhancing LLM reasoning through structured step-by-step explanations.
Findings
TracePile improves LLM performance on math and code benchmarks.
Two-stage fine-tuning with TracePile yields significant accuracy gains.
Explicit reasoning via code execution enhances model robustness and generalization.
Abstract
Building robust and general reasoning ability is a central goal in the development of large language models (LLMs). Recent efforts increasingly turn to code as a rich training source, given its inherent logical structure and diverse reasoning paradigms such as divide-and-conquer, topological ordering, and enumeration. However, reasoning in code is often expressed implicitly and entangled with syntactic or implementation noise, making direct training on raw code suboptimal.To address this, we introduce TracePile, a large-scale corpus of 2.6 million samples that transforms code execution into explicit, step-by-step chain-of-thought-style rationales, which we call Chain of Execution (CoE). The corpus spans domains including mathematics, classical algorithms and algorithmic competition, and is enriched with variable-tracing questions and code rewritings to enhance logical granularity and…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
