MathGenie: Generating Synthetic Data with Question Back-translation for   Enhancing Mathematical Reasoning of LLMs

Zimu Lu; Aojun Zhou; Houxing Ren; Ke Wang; Weikang Shi; Junting Pan,; Mingjie Zhan; Hongsheng Li

arXiv:2402.16352·cs.CL·September 12, 2024·2 cites

MathGenie: Generating Synthetic Data with Question Back-translation for Enhancing Mathematical Reasoning of LLMs

Zimu Lu, Aojun Zhou, Houxing Ren, Ke Wang, Weikang Shi, Junting Pan,, Mingjie Zhan, Hongsheng Li

PDF

Open Access 2 Models 2 Datasets 1 Video

TL;DR

MathGenie introduces a novel data augmentation method using question back-translation to improve the mathematical reasoning capabilities of large language models, achieving state-of-the-art results among open-source models.

Contribution

The paper presents MathGenie, a new approach for generating diverse math problems from limited data, enhancing open-source LLMs' reasoning performance with a back-translation technique.

Findings

01

MathGenie models outperform previous open-source models on five datasets.

02

MathGenieLM-InternLM2 achieves 87.7% on GSM8K and 55.7% on MATH.

03

The augmentation technique significantly improves reasoning accuracy.

Abstract

Large language models (LLMs) have exhibited great potential in mathematical reasoning. However, there remains a performance gap in this area between existing open-source models and closed-source models such as GPT-4. In this paper, we introduce MathGenie, a novel method for generating diverse and reliable math problems from a small-scale problem-solution dataset (denoted as seed data). We augment the ground-truth solutions of our seed data and train a back-translation model to translate the augmented solutions back into new questions. Subsequently, we generate code-integrated solutions for the new questions. To ensure the correctness of the code-integrated solutions, we employ rationale-based strategy for solution verification. Various pretrained models, ranging from 7B to 70B, are trained on the newly curated data to test the effectiveness of the proposed augmentation technique,…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Models

Datasets

Videos

MathGenie: Generating Synthetic Data with Question Back-translation for Enhancing Mathematical Reasoning of LLMs· underline

Taxonomy

TopicsMathematics, Computing, and Information Processing · Natural Language Processing Techniques · Intelligent Tutoring Systems and Adaptive Learning

MethodsLinear Layer · Dropout · Layer Normalization · Byte Pair Encoding · Multi-Head Attention · Dense Connections · Label Smoothing · Adam · Attention Is All You Need · Softmax