How well do Large Language Models perform in Arithmetic tasks?
Zheng Yuan, Hongyi Yuan, Chuanqi Tan, Wei Wang, Songfang Huang

TL;DR
This paper introduces MATH 401, a dataset designed to evaluate the arithmetic capabilities of large language models like GPT-4 and LLaMA, highlighting their strengths and weaknesses in solving math problems.
Contribution
The paper presents a new dataset, MATH 401, specifically for assessing the arithmetic abilities of large language models, filling a gap in existing evaluation methods.
Findings
Large language models show varying arithmetic accuracy.
Evaluation reveals strengths and weaknesses in models' arithmetic reasoning.
MATH 401 enables standardized benchmarking of LLM arithmetic skills.
Abstract
Large language models have emerged abilities including chain-of-thought to answer math word problems step by step. Solving math word problems not only requires abilities to disassemble problems via chain-of-thought but also needs to calculate arithmetic expressions correctly for each step. To the best of our knowledge, there is no work to focus on evaluating the arithmetic ability of large language models. In this work, we propose an arithmetic dataset MATH 401 to test the latest large language models including GPT-4, ChatGPT, InstrctGPT, Galactica, and LLaMA with various arithmetic expressions and provide a detailed analysis of the ability of large language models. MATH 401 and evaluation codes are released at \url{https://github.com/GanjinZero/math401-llm}.
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsTopic Modeling
MethodsAttention Is All You Need · Test · Adam · Layer Normalization · Linear Layer · Label Smoothing · Dropout · Byte Pair Encoding · Multi-Head Attention · Position-Wise Feed-Forward Layer
