Towards Contamination Resistant Benchmarks

Rahmatullah Musawi; Sheng Lu

arXiv:2505.08389·cs.CL·May 14, 2025

Towards Contamination Resistant Benchmarks

Rahmatullah Musawi, Sheng Lu

PDF

TL;DR

This paper introduces a contamination resistant benchmark using Caesar ciphers to evaluate large language models more reliably, revealing their struggles and highlighting the need for more rigorous assessment methods.

Contribution

It proposes a novel contamination resistant benchmark based on Caesar ciphers and demonstrates its effectiveness in revealing LLM limitations.

Findings

01

LLMs struggle with the Caesar cipher benchmark when contamination is controlled

02

Current evaluation methods may overestimate LLM capabilities

03

The benchmark highlights the need for more robust evaluation techniques

Abstract

The rapid development of large language models (LLMs) has transformed the landscape of natural language processing. Evaluating LLMs properly is crucial for understanding their potential and addressing concerns such as safety. However, LLM evaluation is confronted by various factors, among which contamination stands out as a key issue that undermines the reliability of evaluations. In this work, we introduce the concept of contamination resistance to address this challenge. We propose a benchmark based on Caesar ciphers (e.g., "ab" to "bc" when the shift is 1), which, despite its simplicity, is an excellent example of a contamination resistant benchmark. We test this benchmark on widely used LLMs under various settings, and we find that these models struggle with this benchmark when contamination is controlled. Our findings reveal issues in current LLMs and raise important questions…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.