Can LLMs Master Math? Investigating Large Language Models on Math Stack Exchange
Ankit Satpute, Noah Giessing, Andre Greiner-Petter, Moritz, Schubotz, Olaf Teschke, Akiko Aizawa, Bela Gipp

TL;DR
This study evaluates the mathematical problem-solving abilities of large language models, especially GPT-4, on Math Stack Exchange questions, revealing strengths and limitations in their accuracy and reasoning capabilities.
Contribution
The paper introduces a systematic evaluation of LLMs on real-world math questions and provides detailed analysis of GPT-4's performance and shortcomings in mathematical reasoning.
Findings
GPT-4 outperforms other LLMs on math benchmarks
GPT-4 achieves a P@10 of 0.37 on Math Stack Exchange questions
Limitations in GPT-4's accuracy highlight challenges in complex mathematical reasoning
Abstract
Large Language Models (LLMs) have demonstrated exceptional capabilities in various natural language tasks, often achieving performances that surpass those of humans. Despite these advancements, the domain of mathematics presents a distinctive challenge, primarily due to its specialized structure and the precision it demands. In this study, we adopted a two-step approach for investigating the proficiency of LLMs in answering mathematical questions. First, we employ the most effective LLMs, as identified by their performance on math question-answer benchmarks, to generate answers to 78 questions from the Math Stack Exchange (MSE). Second, a case analysis is conducted on the LLM that showed the highest performance, focusing on the quality and accuracy of its answers through manual evaluation. We found that GPT-4 performs best (nDCG of 0.48 and P@10 of 0.37) amongst existing LLMs fine-tuned…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsMathematics, Computing, and Information Processing
MethodsAttention Is All You Need · Linear Layer · Layer Normalization · Multi-Head Attention · Adam · Byte Pair Encoding · Absolute Position Encodings · Softmax · Dense Connections · Label Smoothing
