Are Large Language Models Truly Smarter Than Humans?
Eshwar Reddy M, Sourav Karmakar

TL;DR
This paper conducts a rigorous contamination audit of six large language models, revealing significant data overlap with training sources and its impact on their evaluated performance across various subjects.
Contribution
It introduces three complementary experiments to detect data contamination in LLMs, providing a detailed contamination ranking and analyzing its effect on model performance.
Findings
13.8% overall contamination rate in questions
Performance gains of up to 0.054 accuracy points due to contamination
72.5% of models show memorization signals above chance
Abstract
Public leaderboards increasingly suggest that large language models (LLMs) surpass human experts on benchmarks spanning academic knowledge, law, and programming. Yet most benchmarks are fully public, their questions widely mirrored across the internet, creating systematic risk that models were trained on the very data used to evaluate them. This paper presents three complementary experiments forming a rigorous multi-method contamination audit of six frontier LLMs: GPT-4o, GPT-4o-mini, DeepSeek-R1, DeepSeek-V3, Llama-3.3-70B, and Qwen3-235B. Experiment 1 applies a lexical contamination detection pipeline to 513 MMLU questions across all 57 subjects, finding an overall contamination rate of 13.8% (18.1% in STEM, up to 66.7% in Philosophy) and estimated performance gains of +0.030 to +0.054 accuracy points by category. Experiment 2 applies a paraphrase and indirect-reference diagnostic to…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsComputational and Text Analysis Methods · Artificial Intelligence in Healthcare and Education · Topic Modeling
