TL;DR
This paper introduces a hierarchical, multi-agent framework called Cross-Context Verification (CCV) for detecting benchmark contamination in large language models, achieving perfect separation between contaminated and genuine reasoning.
Contribution
It presents a novel black-box, multi-session detection method and a hierarchical analysis architecture that significantly improves contamination detection accuracy over existing approaches.
Findings
Contamination is binary—models either recall perfectly or not at all.
Reasoning absence is a perfect discriminator.
33% of prior contamination labels are false positives.
Abstract
LLM coding benchmarks face a credibility crisis: widespread solution leakage and test quality issues undermine SWE-bench Verified, while existing detection methods--paraphrase consistency, n-gram overlap, perplexity analysis--never directly observe whether a model reasons or recalls. Meanwhile, simply repeating verification degrades accuracy: multi-turn review generates false positives faster than it discovers true errors, suggesting that structural approaches are needed. We introduce Cross-Context Verification (CCV), a black-box method that solves the same benchmark problem in N independent sessions and measures solution diversity, combined with the Hierarchical Cross-Context Architecture (HCCA), a multi-agent analysis framework that prevents confirmation bias through intentional information restriction across specialized analytical roles. On 9 SWE-bench Verified problems (45…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
