LLMs Are Not Intelligent Thinkers: Introducing Mathematical Topic Tree   Benchmark for Comprehensive Evaluation of LLMs

Arash Gholami Davoodi; Seyed Pouyan Mousavi Davoudi; Pouya Pezeshkpour

arXiv:2406.05194·cs.CL·April 1, 2025

LLMs Are Not Intelligent Thinkers: Introducing Mathematical Topic Tree Benchmark for Comprehensive Evaluation of LLMs

Arash Gholami Davoodi, Seyed Pouyan Mousavi Davoudi, Pouya Pezeshkpour

PDF

Open Access 1 Repo 1 Video

TL;DR

This paper introduces the Mathematical Topics Tree (MaTT) benchmark to comprehensively evaluate LLMs' mathematical reasoning across diverse topics, revealing limited accuracy and reasoning capabilities even in advanced models like GPT-4.

Contribution

The paper presents MaTT, a structured, large-scale benchmark for assessing LLMs' mathematical reasoning, highlighting current limitations and discrepancies in model performance.

Findings

01

GPT-4 achieved 54% accuracy on MaTT

02

Chain-of-Thought prompting showed limited improvement

03

Model explanations were only 53.3% complete and correct when answers were correct

Abstract

Large language models (LLMs) demonstrate impressive capabilities in mathematical reasoning. However, despite these achievements, current evaluations are mostly limited to specific mathematical topics, and it remains unclear whether LLMs are genuinely engaging in reasoning. To address these gaps, we present the Mathematical Topics Tree (MaTT) benchmark, a challenging and structured benchmark that offers 1,958 questions across a wide array of mathematical subjects, each paired with a detailed hierarchical chain of topics. Upon assessing different LLMs using the MaTT benchmark, we find that the most advanced model, GPT-4, achieved a mere 54\% accuracy in a multiple-choice scenario. Interestingly, even when employing Chain-of-Thought prompting, we observe mostly no notable improvement. Moreover, LLMs accuracy dramatically reduced by up to 24.2 percentage point when the questions were…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

arashgholami/MaTT
noneOfficial

Videos

LLMs Are Not Intelligent Thinkers: Introducing Mathematical Topic Tree Benchmark for Comprehensive Evaluation of LLMs· underline

Taxonomy

TopicsResearch Data Management Practices · Scientific Computing and Data Management

MethodsAttention Is All You Need · Softmax · Layer Normalization · Linear Layer · Byte Pair Encoding · Label Smoothing · Adam · Residual Connection · Multi-Head Attention · Position-Wise Feed-Forward Layer