Provable Joint Decontamination for Benchmarking Multiple Large Language Models
Zhenlong Liu, Hao Zeng, Hongxin Wei

TL;DR
This paper introduces JECS, a conformal method for joint decontamination of benchmark data across multiple large language models, ensuring fair evaluation by controlling contamination rates with theoretical guarantees.
Contribution
It formalizes multi-model benchmark decontamination as a joint problem and proposes JECS, a novel conformal procedure with provable global contamination rate control.
Findings
JECS achieves higher power than baseline methods.
JECS maintains target contamination rate control.
Extensive experiments validate JECS's effectiveness.
Abstract
Benchmark data contamination has become a central challenge in LLM evaluation: when evaluation examples appear in the training data of one or more audited models, reported performance can be inflated and cross-model comparisons become unreliable. A broad line of training-data detection work designs scores to quantify how strongly a model memorizes a given data point, but these score-based methods lack theoretical guarantees. Recent conformal approaches provide provable false-identification control for a single model; however, applying them separately to each model can produce model-specific benchmarks, undermining fair comparison across models. In this work, we formalize multi-model benchmark decontamination as a joint selection problem and propose Joint Envelope Conformal Selection (JECS), a conformal procedure that enables global contamination rate (GCR) control under stated…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
