AgentMath: Empowering Mathematical Reasoning for Large Language Models via Tool-Augmented Agent
Haipeng Luo, Huawen Feng, Qingfeng Sun, Can Xu, Kai Zheng, Yufei Wang, Tao Yang, Han Hu, Yansong Tang

TL;DR
AgentMath introduces a tool-augmented agent framework that combines language models with code interpreters, significantly improving mathematical reasoning accuracy and efficiency on complex benchmarks through innovative training and reinforcement learning techniques.
Contribution
The paper presents a novel agent framework integrating language models with code interpreters, featuring automated data generation, reinforcement learning for tool use, and efficient training methods, advancing mathematical reasoning capabilities.
Findings
Achieves state-of-the-art accuracy on mathematical benchmarks.
Demonstrates 4-5x speedup in training efficiency.
Surpasses several existing models in complex mathematical tasks.
Abstract
Large Reasoning Models (LRMs) like o3 and DeepSeek-R1 have achieved remarkable progress in reasoning tasks with long cot. However, they remain computationally inefficient and struggle with accuracy when solving problems requiring complex mathematical operations. In this work, we present AgentMath, an agent framework that seamlessly integrates language models' reasoning capabilities with code interpreters' computational precision to efficiently tackle complex mathematical problems. Our approach introduces three key innovations: (1) An automated method that converts natural language chain-of-thought into structured tool-augmented trajectories, generating high-quality supervised fine-tuning (SFT) data to alleviate data scarcity; (2) A novel agentic reinforcement learning (RL) paradigm that dynamically interleaves natural language generation with real-time code execution. This enables…
Peer Reviews
Decision·ICLR 2026 Poster
- This paper did a good work in presentation. All the components in the pipeline are illustrated in good details and intuitions. This provides a good recipe for open-source math prover training. - A very comprehensive comparison between the model trained from this work and other open/closed-source models are provided, making the results very convincing.
- The paper's contribution is mostly on the development of the entire pipeline from my perspective. As many technical innovations claimed in the paper are either standard or mostly from engineering aspects, as agentic RL training is not a fresh concept nowadays (including coding-assisted math reasoning). I value a lot of the paper's efforts on developing such a comprehensive pipeline (SFT data collection and large-scale RL system building), which I understand is very challenging and helpful for
1. The refinement pipeline (format consistency, executability checks, feedback alignment, self-correction) shows sizable SFT gains and scaling trends before RL. 2. The async scheduler + partial rollout + prefix-aware balancing reportedly lift RL throughput 4–5×, and the paper provides a breakdown and sensitivity to segment count. 3. The 30B model achieves 90.6/86.4/73.8% on AIME24/25/HMMT25
1. Novelty over prior tool-augmented reasoning may feel incremental. The community already has multiple tool-augmented RL frameworks (e.g., ReTool-style RL teaching strategic tool calls) and mature RL infra with asynchronous rollout and truncation/partial-trajectory techniques (e.g. AREAL, ROLL). Much of AgentMath’s lift appears to stem from more elaborate SFT data synthesis and a careful infra implementation rather than a fundamentally new RL principle. From a novelty lens, the async scheduler
* The paper conducts an important study on integrating the Code Interpreter into LLMs to improve mathematical reasoning via symbolic computation. * The data synthesis pipeline and training strategies (including the code execution sandbox and adaptive load balancing) are technically sound and thoughtfully designed to improve training efficiency. * The proposed RL with Code Interpreter integration is shown to be effective through extensive experimental results. * The experiments are extensive a
* Concern about benchmark: The math benchmarks (AIME24, AIME25, HMMT) each contain only about 30 questions. Prior work has shown that these datasets may have leaked into open-source model training corpora. Thus, it is unclear whether the reported results are entirely reliable. The authors are encouraged to provide more evidence on the independent contribution of the proposed approach. * Training-testing overlap: The authors report 346k training questions for SFT and 42k for RL, while the total
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsTopic Modeling · Machine Learning in Materials Science · Multimodal Machine Learning Applications
