MMLU-SR: A Benchmark for Stress-Testing Reasoning Capability of Large   Language Models

Wentian Wang; Sarthak Jain; Paul Kantor; Jacob Feldman; Lazaros; Gallos; Hao Wang

arXiv:2406.15468·cs.CL·October 7, 2024

MMLU-SR: A Benchmark for Stress-Testing Reasoning Capability of Large Language Models

Wentian Wang, Sarthak Jain, Paul Kantor, Jacob Feldman, Lazaros, Gallos, Hao Wang

PDF

Open Access 1 Datasets

TL;DR

MMLU-SR is a new benchmark dataset designed to evaluate the true reasoning and comprehension abilities of large language models by testing their performance on modified questions with key terms replaced, revealing gaps in understanding.

Contribution

The paper introduces MMLU-SR, a novel stress-test dataset that challenges LLMs' comprehension by replacing key terms, highlighting limitations of current models.

Findings

01

Recent LLMs' performance drops significantly on MMLU-SR after key term replacement.

02

High scores on standard benchmarks do not guarantee true understanding.

03

MMLU-SR provides a rigorous test for evaluating genuine reasoning capabilities.

Abstract

We propose MMLU-SR, a novel dataset designed to measure the true comprehension abilities of Large Language Models (LLMs) by challenging their performance in question-answering tasks with modified terms. We reasoned that an agent that "truly" understands a concept can still evaluate it when key terms are replaced by suitably defined alternate terms, and sought to differentiate such comprehension from mere text replacement. In our study, we modified standardized test questions by replacing a key term with a dummy word along with its definition. The key term could be in the context of questions, answers, or both questions and answers. Notwithstanding the high scores achieved by recent popular LLMs on the MMLU leaderboard, we found a substantial reduction in model performance after such replacement, suggesting poor comprehension. This new benchmark provides a rigorous benchmark for testing…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Datasets

NiniCat/MMLU-SR
dataset· 471 dl
471 dl

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsTopic Modeling