Forget What You Know about LLMs Evaluations -- LLMs are Like a Chameleon
Nurit Cohen-Inger, Yehonatan Elisha, Bracha Shapira, Lior Rokach, Seffi Cohen

TL;DR
This paper introduces C-BOD, a framework that detects overfitting in LLMs by perturbing prompts, revealing that many models rely on superficial cues rather than true understanding, thus challenging current evaluation practices.
Contribution
The paper presents C-BOD, a novel meta-evaluation method that systematically distorts benchmark prompts to assess LLM robustness and overfitting, highlighting limitations of current benchmarks.
Findings
Models with higher baseline accuracy are more sensitive to prompt perturbations.
Larger LLMs tend to over-rely on fixed prompt patterns.
Many models show significant performance drops under prompt rephrasing.
Abstract
Large language models (LLMs) often appear to excel on public benchmarks, but these high scores may mask an overreliance on dataset-specific surface cues rather than true language understanding. We introduce the Chameleon Benchmark Overfit Detector (C-BOD), a meta-evaluation framework that systematically distorts benchmark prompts via a parametric transformation and detects overfitting of LLMs. By rephrasing inputs while preserving their semantic content and labels, C-BOD exposes whether a model's performance is driven by memorized patterns. Evaluated on the MMLU benchmark using 26 leading LLMs, our method reveals an average performance degradation of 2.15% under modest perturbations, with 20 out of 26 models exhibiting statistically significant differences. Notably, models with higher baseline accuracy exhibit larger performance differences under perturbation, and larger LLMs tend to be…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
Taxonomy
TopicsLibrary Science and Information Systems
MethodsLLaMA
