MMLU-SR: A Benchmark for Stress-Testing Reasoning Capability of Large Language Models
Wentian Wang, Sarthak Jain, Paul Kantor, Jacob Feldman, Lazaros, Gallos, Hao Wang

TL;DR
MMLU-SR is a new benchmark dataset designed to evaluate the true reasoning and comprehension abilities of large language models by testing their performance on modified questions with key terms replaced, revealing gaps in understanding.
Contribution
The paper introduces MMLU-SR, a novel stress-test dataset that challenges LLMs' comprehension by replacing key terms, highlighting limitations of current models.
Findings
Recent LLMs' performance drops significantly on MMLU-SR after key term replacement.
High scores on standard benchmarks do not guarantee true understanding.
MMLU-SR provides a rigorous test for evaluating genuine reasoning capabilities.
Abstract
We propose MMLU-SR, a novel dataset designed to measure the true comprehension abilities of Large Language Models (LLMs) by challenging their performance in question-answering tasks with modified terms. We reasoned that an agent that "truly" understands a concept can still evaluate it when key terms are replaced by suitably defined alternate terms, and sought to differentiate such comprehension from mere text replacement. In our study, we modified standardized test questions by replacing a key term with a dummy word along with its definition. The key term could be in the context of questions, answers, or both questions and answers. Notwithstanding the high scores achieved by recent popular LLMs on the MMLU leaderboard, we found a substantial reduction in model performance after such replacement, suggesting poor comprehension. This new benchmark provides a rigorous benchmark for testing…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsTopic Modeling
