How Can I Publish My LLM Benchmark Without Giving the True Answers Away?

Takashi Ishida; Thanawat Lodkaew; Ikko Yamane

arXiv:2505.18102·cs.LG·October 7, 2025

How Can I Publish My LLM Benchmark Without Giving the True Answers Away?

Takashi Ishida, Thanawat Lodkaew, Ikko Yamane

PDF

1 Datasets 3 Reviews

TL;DR

This paper introduces a method to publish LLM benchmarks without revealing true answers by injecting randomness, enabling open evaluation while detecting data contamination and preventing test-set overfitting.

Contribution

The authors propose a novel approach to publish benchmarks with hidden ground truths using answer randomness, balancing transparency and data protection.

Findings

01

Method effectively detects data contamination across benchmarks.

02

Randomized answers prevent models from surpassing Bayes accuracy.

03

Approach maintains open evaluation without revealing true answers.

Abstract

Publishing a large language model (LLM) benchmark on the Internet risks contaminating future LLMs: the benchmark may be unintentionally (or intentionally) used to train or select a model. A common mitigation is to keep the benchmark private and let participants submit their models or predictions to the organizers. However, this strategy will require trust in a single organization and still permits test-set overfitting through repeated queries. To overcome this issue, we propose a way to publish benchmarks without completely disclosing the ground-truth answers to the questions, while still maintaining the ability to openly evaluate LLMs. The main underlying idea is to reduces the best possible accuracy, i.e., Bayes accuracy, by injecting randomness to the answers by preparing several logically correct answers, and only include one of them as the solution in the benchmark. Not only is…

Peer Reviews

Decision·Submitted to ICLR 2026

Reviewer 01Rating 8Confidence 3

Strengths

* The method is straightforward and intuitive. * They compare their method to reasonable baselines like canary string and min-k%. I agree with their comparisons with these baseline, on how their method avoids some of the baseline's shortcomings. * The paper is fleshed out and well-detailed. I appreciate how the questions asked in the red paragraph headers were answered, and in particular I found myself asking a lot of those questions before finding them already answered.

Weaknesses

* Most notably, This method needs to be done on the benchmark _before_ it is released. This limits its utility as it requires benchmark creators to adopt this method and only release the CapBenched version. Given that pre-release adoption is a necessary condition for this method to work, the following weaknesses are then considered because they may hurt adoption: * First, It is not clear whether the true accuracy can be well estimated if the benchmark is capped (See clarification questions label

Reviewer 02Rating 4Confidence 3

Strengths

- the authors empirically show that their method (CapBencher) detects contamination more reliably than SoTA detection baselines. - a wide range of benchmarks, and models, and baselines. - the strategies are simple and can be easily implemented on several benchmarks by string concatenation. - the presented method improves over the contamination detection comparators.

Weaknesses

- the practical strategy "Disclosure allowed" seems weak (as acknowledged by the authors) and lacks novelty. It is simple prompt injection. - both practical strategies are easily circumventable. - the paper does not have a Limitations section.

Reviewer 03Rating 4Confidence 4

Strengths

1. The primary strength is the originality of the core idea. Instead of treating contamination as a post-hoc detection problem, CapBencher integrates the detection mechanism into the benchmark's design. Using the Bayes accuracy as a statistical "ceiling" is an elegant and principled solution. 2. The paper does not just rely on empirical results. Theorem 1, which proves the affine relationship between capped and original scores, provides a strong theoretical foundation. This guarantee—that mode

Weaknesses

My main concerns are that the method, while clever, may be brittle against a non-naive adversary and that its central premise of "concealment" is overstated. 1. The paper's detection mechanism hinges on the assumption that a contaminated model will "try to memorize the realized values" (e.g., memorize "19" for the "3x6" question). This seems like a naive threat model. A more plausible attack, especially if this method were adopted widely, would be for the model to learn the randomization rule

Code & Models

Datasets

ishidalab/capbencher
dataset· 10 dl
10 dl

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.