Common 7B Language Models Already Possess Strong Math Capabilities

Chen Li; Weiqi Wang; Jingcheng Hu; Yixuan Wei; Nanning Zheng; Han Hu,; Zheng Zhang; Houwen Peng

arXiv:2403.04706·cs.CL·March 8, 2024·3 cites

Common 7B Language Models Already Possess Strong Math Capabilities

Chen Li, Weiqi Wang, Jingcheng Hu, Yixuan Wei, Nanning Zheng, Han Hu,, Zheng Zhang, Houwen Peng

PDF

Open Access 2 Repos 10 Models 5 Datasets 4 Reviews

TL;DR

This paper demonstrates that the LLaMA-2 7B language model inherently possesses strong mathematical reasoning abilities, which can be significantly improved through simple scaling of training data, including synthetic data, without requiring extensive math-specific pre-training.

Contribution

It reveals that common 7B language models already have strong math skills and shows how scaling up training data, especially synthetic data, enhances their mathematical accuracy.

Findings

01

LLaMA-2 7B achieves 97.7% accuracy on GSM8K with response selection.

02

Synthetic data scaling improves math accuracy nearly as much as real data.

03

Scaling training data boosts model performance beyond previous models.

Abstract

Mathematical capabilities were previously believed to emerge in common language models only at a very large scale or require extensive math-related pre-training. This paper shows that the LLaMA-2 7B model with common pre-training already exhibits strong mathematical abilities, as evidenced by its impressive accuracy of 97.7% and 72.0% on the GSM8K and MATH benchmarks, respectively, when selecting the best response from 256 random generations. The primary issue with the current base model is the difficulty in consistently eliciting its inherent mathematical capabilities. Notably, the accuracy for the first answer drops to 49.5% and 7.9% on the GSM8K and MATH benchmarks, respectively. We find that simply scaling up the SFT data can significantly enhance the reliability of generating correct answers. However, the potential for extensive scaling is constrained by the scarcity of publicly…

Peer Reviews

Decision·ICLR 2025 Conference Withdrawn Submission

Reviewer 01Rating 5Confidence 3

Strengths

The authors uncover an interesting and less-explored phenomenon: by generating a large number of responses (up to 1024) per question, even small-scale language models like LLaMA-2 7B can achieve remarkably high accuracy on mathematical benchmarks. Comprehensive Experimental Evaluation: The authors conduct extensive experiments across multiple benchmarks, including GSM8K, MATH, etc. Well-Structured Presentation: The paper is organized logically, guiding the reader through the motivation, methodol

Weaknesses

Lack of Novelty: The approach of enhancing language models using synthetic data and multiple answer generation has been explored in previous studies. And the way they generate the SFT data may cause overfitting towards the evaluation.. Practical Limitations: Generating 1024 responses per question to select the correct answer is impractical for real-world applications due to computational inefficiency.

Reviewer 02Rating 3Confidence 4

Strengths

Strengths: Scaling Approach: The paper demonstrates that scaling synthetic data can effectively improve performance for smaller models, which challenges the assumption that only large-scale models excel in complex reasoning tasks. High Benchmark Performance: The LLaMA-2 7B model's performance on GSM8K and MATH benchmarks is commendable and suggests that synthetic data can be a viable alternative to real data in specific contexts. Experiment Setup: The paper includes various data scaling scenario

Weaknesses

1 Training Data Overlap Concerns: There’s a significant possibility that LLaMA-2's pretraining data includes or paraphrases the benchmarks used in evaluation, especially given the unknowns around its training corpus. This overlap could mean that the results simply "unlock" capabilities that were already in the model due to exposure during pretraining. Testing on a model where pretraining data is known and does not include the benchmarks would provide a more credible assessment. 2 Limited Benchma

Reviewer 03Rating 3Confidence 5

Strengths

The paper has interesting empirical findings that may be of interest to researchers & developers on the math performance of small LLMs on large amounts of synthetic distillation data. The paper is generally well written and easy to follow, for the most part. The methods are detailed especially nicely to ensure ease of understanding and reproducibility.

Weaknesses

The main weaknesses are: 1. Lack of significant novelty or surprising results: Prior work has shown that extensive math-related data pre-training improves the math capabilities of 7B scale models. These results, illustrated even w/in the first sentence of the abstract, reduces the novelty of this work. It follows that if 7B scale models have stronger math performance after pretraining on more math related data, that we can similarly distill strong model math perf into smaller models in post-trai

Reviewer 04Rating 3Confidence 4

Strengths

1- Understanding reasoning capabilities of models, including mathematical capabilities, is an important research direction. 2- The paper is well-written and the arguments are presented well.

Weaknesses

1- The observation that increasing the number of generations improves the probability of obtaining a correct answer is not new to this work. For example, figure 4 from [1] reports the same observation on GSM8K. 2- Most importantly, the use of pass@1024 to claim a model's "capability" is methodologically questionable. Consider this extreme example for clarity: Imagine the target task is finding a path from start to end in a maze. If I write a program that randomly tests paths in the maze, with e

Code & Models

Repositories

Models

Datasets

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsNatural Language Processing Techniques · Multi-Agent Systems and Negotiation · Machine Learning and Algorithms

MethodsShrink and Fine-Tune · Balanced Selection