LogProber: Disentangling confidence from contamination in LLM responses

Nicolas Yax; Pierre-Yves Oudeyer; Stefano Palminteri

arXiv:2408.14352·cs.CL·June 23, 2025

LogProber: Disentangling confidence from contamination in LLM responses

Nicolas Yax, Pierre-Yves Oudeyer, Stefano Palminteri

PDF

Open Access 3 Reviews

TL;DR

LogProber is a new algorithm designed to detect data contamination in large language models by assessing question familiarity, improving fairness in performance evaluation of these models.

Contribution

It introduces a novel, efficient contamination detection method focusing on question familiarity, addressing limitations of previous approaches in black box settings.

Findings

01

LogProber effectively detects contamination in LLMs.

02

It outperforms some existing methods in certain scenarios.

03

The method has limitations depending on contamination types.

Abstract

In machine learning, contamination refers to situations where testing data leak into the training set. The issue is particularly relevant for the evaluation of the performance of Large Language Models (LLMs), which are generally trained on gargantuan, and generally opaque, corpora of text scraped from the world wide web. Developing tools to detect contamination is therefore crucial to be able to fairly and properly track the evolution of the performance of LLMs. To date, only a few recent studies have attempted to address the issue of quantifying and detecting contamination in short text sequences, such as those commonly found in benchmarks. However, these methods have limitations that can sometimes render them impractical. In the present paper, we introduce LogProber, a novel, efficient algorithm that we show to be able to detect contamination in a black box setting that tries to…

Peer Reviews

Decision·Submitted to ICLR 2026

Reviewer 01Rating 2Confidence 3

Strengths

1. Solves the "Confidence" Flaw: The paper's main strength is identifying that existing detectors mistake a model's high confidence for contamination. Its novel solution is to analyze the question text instead of the answer, successfully disentangling genuine skill from memorization. 2. High Transparency: The authors are rigorous and transparent about the tool's limitations. They explicitly demonstrate that LogProber is blind to "answer-only" (-A) contamination, which is a common format for fi

Weaknesses

1. The writing can be improved -- the introductory content is too long and the citation format can be further improved. Most importantly, the paper will benefit from adding a conclusion section and related work section. These two are clearly missing. 2. The Llama-1-7B model used in the experiment is too old. And the baselines are too few and not strong enough. It only compared with CDD, while data contamination detection is not a new topic and there are a lot of existing work defining and addre

Reviewer 02Rating 4Confidence 4

Strengths

The paper addresses a fundamental, high-impact problem. As models become more powerful, their performance on standard benchmarks is increasingly scrutinized for contamination . This work provides a practical tool to help maintain the integrity of LLM evaluation.

Weaknesses

- The paper introduces a specific, non-trivial formula for the "Safe Score" based on the integral of the sorted cumulative log-probabilities (Equation 1). However, there is no justification provided for why this specific formulation is optimal, or even necessary, compared to simpler, more direct statistical measures of the "plateness" of the $log(p)$ curve. For instance, what about the simple variance of the $log(p)$ values? Or the 10th percentile of $log(p)$? A contaminated sequence should have

Reviewer 03Rating 2Confidence 4

Strengths

1. Important problem, data contamination is still a difficult and important problem to be solved. 2. The paper explained their key ideas very clearly

Weaknesses

1. Lack of innovation, there have been a wide range of confidence/logP based-scores [1], and people have already figured that rephrasing would escape those detection methods [2, 3]. This method lacks merit in advancing the field. 2. Model / dataset used are too simple. Only one set of experiments are done (CRT / Llama-1) to show effectiveness. 3. Lack of analysis. Does question length play an effect here? What about CoT models? [1] Zhang, Huixuan, Yun Lin, and Xiaojun Wan. "Pacost: Paired con

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsTopic Modeling · Natural Language Processing Techniques