Towards Contamination Resistant Benchmarks
Rahmatullah Musawi, Sheng Lu

TL;DR
This paper introduces a contamination resistant benchmark using Caesar ciphers to evaluate large language models more reliably, revealing their struggles and highlighting the need for more rigorous assessment methods.
Contribution
It proposes a novel contamination resistant benchmark based on Caesar ciphers and demonstrates its effectiveness in revealing LLM limitations.
Findings
LLMs struggle with the Caesar cipher benchmark when contamination is controlled
Current evaluation methods may overestimate LLM capabilities
The benchmark highlights the need for more robust evaluation techniques
Abstract
The rapid development of large language models (LLMs) has transformed the landscape of natural language processing. Evaluating LLMs properly is crucial for understanding their potential and addressing concerns such as safety. However, LLM evaluation is confronted by various factors, among which contamination stands out as a key issue that undermines the reliability of evaluations. In this work, we introduce the concept of contamination resistance to address this challenge. We propose a benchmark based on Caesar ciphers (e.g., "ab" to "bc" when the shift is 1), which, despite its simplicity, is an excellent example of a contamination resistant benchmark. We test this benchmark on widely used LLMs under various settings, and we find that these models struggle with this benchmark when contamination is controlled. Our findings reveal issues in current LLMs and raise important questions…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
