ToRA: A Tool-Integrated Reasoning Agent for Mathematical Problem Solving

Zhibin Gou; Zhihong Shao; Yeyun Gong; Yelong Shen; Yujiu Yang; Minlie; Huang; Nan Duan; Weizhu Chen

arXiv:2309.17452·cs.CL·February 22, 2024·21 cites

ToRA: A Tool-Integrated Reasoning Agent for Mathematical Problem Solving

Zhibin Gou, Zhihong Shao, Yeyun Gong, Yelong Shen, Yujiu Yang, Minlie, Huang, Nan Duan, Weizhu Chen

PDF

Open Access 1 Repo 10 Models 3 Reviews

TL;DR

ToRA introduces a tool-integrated reasoning framework that enhances mathematical problem-solving by combining language models with external computational tools, achieving state-of-the-art results on multiple datasets.

Contribution

The paper presents ToRA, a novel tool-integrated reasoning agent that significantly improves mathematical reasoning performance by training on interactive tool-use trajectories and output space shaping.

Findings

01

ToRA models outperform open-source baselines on 10 datasets.

02

ToRA-7B achieves 44.6% on MATH, surpassing WizardMath-70B.

03

ToRA-Code-34B exceeds 50% accuracy on MATH, outperforming GPT-4 CoT.

Abstract

Large language models have made significant progress in various language tasks, yet they still struggle with complex mathematics. In this paper, we propose ToRA a series of Tool-integrated Reasoning Agents designed to solve challenging mathematical problems by seamlessly integrating natural language reasoning with the utilization of external tools (e.g., computation libraries and symbolic solvers), thereby amalgamating the analytical prowess of language and the computational efficiency of tools. To train ToRA, we curate interactive tool-use trajectories on mathematical datasets, apply imitation learning on the annotations, and propose output space shaping to further refine models' reasoning behavior. As a result, ToRA models significantly outperform open-source models on 10 mathematical reasoning datasets across all scales with 13%-19% absolute improvements on average. Notably, ToRA-7B…

Peer Reviews

Decision·ICLR 2024 poster

Reviewer 01Rating 8· accept, good paperConfidence 5

Strengths

The idea is clear, well-constructed, and well-explained. The figures are excellent and the algorithm is clearly laid out. The resulting models show considerable performance increases under a range of evaluation settings confirming the efficacy of the strategy.

Weaknesses

While the authors have presented what worked well, there is a considerable amount to be gleaned from the failure modes. The authors loosely allude to failure cases including geometric problems and program timeouts, and provide single examples in the appendix, but there are surely more interesting patterns. It would be wonderful if the authors could provide more specific examples and comment on more systematic classes of errors beyond these simple categorizations. For example, are there certain p

Reviewer 02Rating 8· accept, good paperConfidence 3

Strengths

1.This paper proposes a two-stage training framework that utilizes training data alternating between natural language and code language to enhance the reasoning ability of language models in mathematical reasoning tasks. The experimental results demonstrate the significant improvement of this approach across 10 datasets. 2.The paper is generally well-written and the figures and tables presented are clear and easy to understand.

Weaknesses

1.From Figure 5, it can be observed that the performance of the model does not significantly decrease when output space shaping is removed. More experiments are needed to demonstrate whether the performance improvement in this stage is due to this training strategy rather than additional data and more training epochs. 2.Regarding the TORA-corpus proposed in this paper, more detailed information is needed regarding the data construction process, quality evaluation, and dataset statistics.

Reviewer 03Rating 5· marginally below the acceptance thresholdConfidence 5

Strengths

- The paper is easy to follow - TORA achieves good performance on math datasets

Weaknesses

- **Limited of technical novelty**: - Using imitation learning to improve the mathematical reasoning ability of open-source models has been proposed in many recent works, e.g., - Scaling relationship on learning mathematical reasoning with large language models, https://arxiv.org/abs/2308.01825 - WizardMath: Empowering Mathematical Reasoning for Large Language Models via Reinforced Evol-Instruct, https://arxiv.org/abs/2308.09583 - MetaMath: Bootstrap Your Own Mathematical Question

Code & Models

Repositories

microsoft/tora
pytorchOfficial

Models

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsTopic Modeling · Educational Games and Gamification · Artificial Intelligence in Games

MethodsMulti-Head Attention · Attention Is All You Need · Dense Connections · Linear Layer · Label Smoothing · Absolute Position Encodings · Adam · Residual Connection · Layer Normalization · Softmax