ReTool: Reinforcement Learning for Strategic Tool Use in LLMs

Jiazhan Feng; Shijue Huang; Xingwei Qu; Ge Zhang; Yujia Qin; Baoquan; Zhong; Chengquan Jiang; Jinxin Chi; Wanjun Zhong

arXiv:2504.11536·cs.CL·April 18, 2025·2 cites

ReTool: Reinforcement Learning for Strategic Tool Use in LLMs

Jiazhan Feng, Shijue Huang, Xingwei Qu, Ge Zhang, Yujia Qin, Baoquan, Zhong, Chengquan Jiang, Jinxin Chi, Wanjun Zhong

PDF

Open Access 1 Repo 3 Models 4 Datasets 3 Reviews

TL;DR

ReTool introduces a reinforcement learning framework that enables large language models to dynamically and autonomously learn when and how to use computational tools like code interpreters, significantly improving complex reasoning tasks such as math problem solving.

Contribution

The paper presents a novel RL-based training paradigm that systematically teaches LLMs to effectively incorporate tools during reasoning without human priors, enhancing their problem-solving capabilities.

Findings

01

ReTool achieves 67% accuracy on AIME with fewer training steps.

02

ReTool outperforms baseline models in efficiency and accuracy.

03

Emergent behaviors include code self-correction and adaptive tool use.

Abstract

While reasoning models (e.g., DeepSeek R1) trained with reinforcement learning (RL), excel in textual reasoning, they struggle in scenarios requiring structured problem-solving, such as geometric reasoning, concise computation, or complex equation solving-areas where computational tools like code interpreters (CI) demonstrate distinct advantages. To bridge this gap, we propose ReTool, which enhances long-form reasoning with tool-integrated learning, including two key features: (1) dynamic interleaving of real-time code execution within natural language reasoning processes, and (2) an automated RL paradigm that allows policy rollouts with multi-turn real-time code execution and teaches the model in learning when and how to invoke tools based on outcome feedback. ReTool employs a systematic training framework, beginning with synthetic cold-start data generation to produce code-augmented…

Peer Reviews

Decision·ICLR 2026 Poster

Reviewer 01Rating 4Confidence 3

Strengths

- The paper gets decent results compared to the baselines and the ablations seem to show that the LLM can use tools in a way that raises its success chance - The statistics on how often responses use the Python code execution tool and how long or correct the code is, is interesting, albeit calling it cognitive analysis seems a bit much

Weaknesses

The primary contribution is an RL framework that integrates reasoning and tool use, by adding delimiter tokens to the tool use part and having a parser check whether it should hand over to a Python interpreter before continuing with the token generation. The reward is the final correctness, which is the same is in the standard RL reasoning paradigm. This is a straightforward approach that has been considered by many people and to distinguish this paper from a "flag-planting" paper, I'd suggest a

Reviewer 02Rating 8Confidence 5

Strengths

1. The proposed two-stage approach, combining supervised learning for foundational skills with reinforcement learning for strategic optimization, is logical, well-motivated, and shown to be highly effective. 2. ReTool achieves state-of-the-art performance on the challenging AIME benchmarks, substantially outperforming both its own 32B backbone and other strong, often larger, models. The reported efficiency (e.g., achieving high scores with only 400 training steps) is particularly impressive and

Weaknesses

1. The performance gain from the RL stage is substantial (e.g., from 40.9% to 67.0% on AIME2024 in the ablation). Based on your cognitive analysis, could you provide more intuition on what you believe is the most critical strategic capability the model learns during RL that SFT on curated data fails to instill? Is it primarily about *when* to invoke the tool, or does it also learn more complex policies like using the tool for iterative verification or hypothesis testing?

Reviewer 03Rating 6Confidence 3

Strengths

1. It provides a practical, reproducible pipeline that many groups could adopt. 2. Competitive results on challenging benchmarks 3. The paper spells out the execution protocol, loss masking, async sandboxing, and caching—practical details that substantially reduce adoption friction.

Weaknesses

1. It has limited algorithmic novelty. The optimization relies on standard PPO. 2. Impact of RL on reasoning upper-bound (pass@k) is missing. I strongly encourage the authors to provide pass@k (k=32,64,128,256,1024) results in the rebuttal phase.

Code & Models

Repositories

volcengine/verl
pytorchOfficial

Models

Datasets

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsTopic Modeling · Multimodal Machine Learning Applications · Artificial Intelligence in Games

MethodsBalanced Selection