Forget What You Know about LLMs Evaluations -- LLMs are Like a Chameleon

Nurit Cohen-Inger; Yehonatan Elisha; Bracha Shapira; Lior Rokach; Seffi Cohen

arXiv:2502.07445·cs.CL·September 18, 2025

Forget What You Know about LLMs Evaluations -- LLMs are Like a Chameleon

Nurit Cohen-Inger, Yehonatan Elisha, Bracha Shapira, Lior Rokach, Seffi Cohen

PDF

Open Access 1 Repo 1 Datasets 1 Video

TL;DR

This paper introduces C-BOD, a framework that detects overfitting in LLMs by perturbing prompts, revealing that many models rely on superficial cues rather than true understanding, thus challenging current evaluation practices.

Contribution

The paper presents C-BOD, a novel meta-evaluation method that systematically distorts benchmark prompts to assess LLM robustness and overfitting, highlighting limitations of current benchmarks.

Findings

01

Models with higher baseline accuracy are more sensitive to prompt perturbations.

02

Larger LLMs tend to over-rely on fixed prompt patterns.

03

Many models show significant performance drops under prompt rephrasing.

Abstract

Large language models (LLMs) often appear to excel on public benchmarks, but these high scores may mask an overreliance on dataset-specific surface cues rather than true language understanding. We introduce the Chameleon Benchmark Overfit Detector (C-BOD), a meta-evaluation framework that systematically distorts benchmark prompts via a parametric transformation and detects overfitting of LLMs. By rephrasing inputs while preserving their semantic content and labels, C-BOD exposes whether a model's performance is driven by memorized patterns. Evaluated on the MMLU benchmark using 26 leading LLMs, our method reveals an average performance degradation of 2.15% under modest perturbations, with 20 out of 26 models exhibiting statistically significant differences. Notably, models with higher baseline accuracy exhibit larger performance differences under perturbation, and larger LLMs tend to be…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

SeffiCohen/CBOD
pytorchOfficial

Datasets

seffico/Rephrased_MMLU
dataset· 9 dl
9 dl

Videos

Forget What You Know about LLMs Evaluations - LLMs are Like a Chameleon· underline

Taxonomy

TopicsLibrary Science and Information Systems

MethodsLLaMA