Can LLMs Master Math? Investigating Large Language Models on Math Stack   Exchange

Ankit Satpute; Noah Giessing; Andre Greiner-Petter; Moritz; Schubotz; Olaf Teschke; Akiko Aizawa; Bela Gipp

arXiv:2404.00344·cs.CL·April 2, 2024·3 cites

Can LLMs Master Math? Investigating Large Language Models on Math Stack Exchange

Ankit Satpute, Noah Giessing, Andre Greiner-Petter, Moritz, Schubotz, Olaf Teschke, Akiko Aizawa, Bela Gipp

PDF

Open Access 1 Repo

TL;DR

This study evaluates the mathematical problem-solving abilities of large language models, especially GPT-4, on Math Stack Exchange questions, revealing strengths and limitations in their accuracy and reasoning capabilities.

Contribution

The paper introduces a systematic evaluation of LLMs on real-world math questions and provides detailed analysis of GPT-4's performance and shortcomings in mathematical reasoning.

Findings

01

GPT-4 outperforms other LLMs on math benchmarks

02

GPT-4 achieves a P@10 of 0.37 on Math Stack Exchange questions

03

Limitations in GPT-4's accuracy highlight challenges in complex mathematical reasoning

Abstract

Large Language Models (LLMs) have demonstrated exceptional capabilities in various natural language tasks, often achieving performances that surpass those of humans. Despite these advancements, the domain of mathematics presents a distinctive challenge, primarily due to its specialized structure and the precision it demands. In this study, we adopted a two-step approach for investigating the proficiency of LLMs in answering mathematical questions. First, we employ the most effective LLMs, as identified by their performance on math question-answer benchmarks, to generate answers to 78 questions from the Math Stack Exchange (MSE). Second, a case analysis is conducted on the LLM that showed the highest performance, focusing on the quality and accuracy of its answers through manual evaluation. We found that GPT-4 performs best (nDCG of 0.48 and P@10 of 0.37) amongst existing LLMs fine-tuned…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

gipplab/llm-investig-mathstackexchange
pytorchOfficial

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsMathematics, Computing, and Information Processing

MethodsAttention Is All You Need · Linear Layer · Layer Normalization · Multi-Head Attention · Adam · Byte Pair Encoding · Absolute Position Encodings · Softmax · Dense Connections · Label Smoothing