# Test of Time: Rethinking Temporal Signal of Benchmark Contamination

**Authors:** Terry Jingchen Zhang, Gopal Dev, Ning Wang, Max Obreiter, Punya Syon Pandey, Keenan Samway, Wenyuan Jiang, Yinya Huang, Bernhard Sch\"olkopf, Mrinmaya Sachan, Zhijing Jin

arXiv: 2509.00072 · 2026-05-14

## TL;DR

This paper challenges the assumption that post-cutoff performance decay in LLMs indicates benchmark contamination, showing that question construction methods significantly influence observed temporal signals.

## Contribution

It reveals that the temporal signal of contamination is highly sensitive to question formulation and proposes more robust evaluation methods.

## Key findings

- LLM-transformed questions show different temporal patterns than original cloze questions.
- Simple LLM-driven transformations can remove the observed temporal decay.
- Influence function analysis helps explain the mechanistic basis of the phenomenon.

## Abstract

Post-cutoff performance decay of LLMs has been widely interpreted as a temporal signal for benchmark contamination, where public information released before the training cutoff may have been included into training corpora and inflated model performance by memorization. We critically examine this view and demonstrate that this temporal signal is highly sensitive to how benchmark questions are constructed, even if the underlying source material remains invariant. Specifically, we show that LLM-transformed questions can produce remarkably different temporal patterns compared to fill-in-the-blank (cloze) questions directly retrieved from the very same documents. We validate this effect on prior benchmarks that report clear post-cutoff decay (LiveCodeBench), and show that a simple LLM-driven transformation of the same problems can effectively remove the temporal pattern. We further provide a mechanistic understanding of this phenomenon using influence function analysis. Overall, our results suggest that post-cutoff performance decay is a sensitive contamination signal, motivating more robust contamination probes for reliable LLM evaluation.

## Full text

_Full body text omitted from this summary view._ Fetch the complete paper as Markdown: https://tomesphere.com/paper/2509.00072/full.md

## Figures

11 figures with captions in the complete paper: https://tomesphere.com/paper/2509.00072/full.md

## References

37 references — full list in the complete paper: https://tomesphere.com/paper/2509.00072/full.md

---
Source: https://tomesphere.com/paper/2509.00072