R1-Code-Interpreter: LLMs Reason with Code via Supervised and Multi-stage Reinforcement Learning
Yongchao Chen, Yueying Liu, Junwei Zhou, Yilun Hao, Jingquan Wang, Yang Zhang, Na Li, Chuchu Fan

TL;DR
This paper introduces R1-Code-Interpreter, a multi-stage reinforcement learning approach to train large language models to reason with code across diverse tasks, significantly improving accuracy and demonstrating emergent self-checking behaviors.
Contribution
The paper presents a novel multi-stage curriculum learning method for training a general-purpose code-interpreting LLM, addressing task heterogeneity and sample scarcity challenges.
Findings
Final model achieves 72.4% accuracy on 37 tasks, surpassing GPT-4o.
RL training improves average accuracy from 44.1% to 72.4%.
Emergent self-checking behavior observed in the model.
Abstract
Practical guidance on training Large Language Models (LLMs) to leverage Code Interpreter across diverse tasks remains lacking. We present R1-Code-Interpreter, an extension of a text-only LLM trained via multi-turn supervised fine-tuning (SFT) and reinforcement learning (RL) to autonomously generate multiple code queries during step-by-step reasoning. Unlike prior RL + tool-use efforts focused on narrow domains such as math or retrieval, we curate 144 diverse reasoning and planning tasks and show that training a general-purpose Code Interpreter across them presents significant challenges due to task heterogeneity and scarcity of effective samples. To address this, we introduce a multi-stage curriculum learning approach that partitions training samples by measured improvement potential. The RL training prioritizes samples with higher potential and gradually shifts to lower-potential ones,…
Peer Reviews
Decision·ICLR 2026 Poster
1. This paper extends the task of combining symbolic code generation with text with reasoning to broader benchmarks, providing a thorough investigation of the generalizability of this paradigm. 2. This paper demonstrates innovation by proposing a curriculum learning method based on Improvement Potential, successfully extending TIR training from single-task settings to multi-task scenarios. 3. Compared to traditional curriculum learning approaches, the proposed method is designed based on impro
1. The improvement potential score defined in this paper is modeled as a function symmetric about p=½. In this case, even if two samples have the same improvement potential score, their empirical correctness rates may differ significantly. Therefore, in curriculum learning, simply incorporating samples with low improvement potential scores may overlook the training contribution differences brought by samples with different empirical correctness rates. For example, training samples with low empir
**Comprehensiveness:** The authors provide a broad, carefully controlled experimental program, spanning diverse task families, staged training with curriculum learning, warm-start ablation, strong baselines, and behavioural system measurements (emergent self-checking, code-usage, verbosity), that provides compelling evidence of the paper’s general-purpose claims. **Unlocking Code Interpreter Potential:** This work offers a concrete, scalable recipe that elevates code execution from a math-only
**Single Scope Language:** The paper aims for a general code interpreter for code generation and reasoning across tasks and domains, but trains and evaluates only with a Python executor; transfer to other languages/runtimes is untested. Identifying details on different languages would materially strengthen the general code-generation claim.
1. Practical, replicable CI protocol: simple python code blocks, a clear final-answer marker, and tight caps on tool use (e.g., max code calls and per-call timeout) that help reproducibility. 2. Tangible engineering win: decoupling code execution into a CPU sandbox reduces RL training wall-clock and avoids GPU stalls. 3. Within-scope performance & diagnostics: decent gains on their own benchmark plus ablations and behavior analysis (e.g., code-based self-checking, typical call counts) that clari
1. Improvement Potential and curriculum learning: This looks novel at first glance, but in RL it is increasingly standard to focus on data where the model neither gets everything right nor everything wrong as the main RL signal. The paper largely describes this under a Bernoulli correctness assumption (Pi = 4p(1 - p)), which—at least to me—does not amount to a significant new idea or contribution beyond formalizing that intuition. 2. Limited evaluation breadth; same-source testing; augmented SFT
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
MethodsShrink and Fine-Tune · Entropy Regularization · Proximal Policy Optimization
