Beyond Surface-Level Similarity: Hierarchical Contamination Detection for Synthetic Training Data in Foundation Models

Sushant Mehta

arXiv:2511.17602·cs.LG·November 25, 2025

Beyond Surface-Level Similarity: Hierarchical Contamination Detection for Synthetic Training Data in Foundation Models

Sushant Mehta

PDF

Open Access 1 Video

TL;DR

This paper introduces a hierarchical contamination detection framework for synthetic training data in foundation models, effectively identifying semantic-level overlaps that current token-based methods miss, thereby improving data audit accuracy.

Contribution

The authors propose a novel multi-level detection framework that captures semantic contamination in synthetic data, surpassing existing token-level methods in accuracy.

Findings

01

Semantic contamination evades existing detection methods (F1=0.17-0.49).

02

Hierarchical approach achieves higher detection F1 score (0.76).

03

Average improvement of 26.5% over state-of-the-art baselines.

Abstract

Synthetic data has become essential for training foundation models, yet benchmark contamination threatens evaluation integrity. Although existing detection methods identify token-level overlap, they fail to detect semantic-level contamination where synthetic data conceptually resemble benchmarks without lexical overlap. This gap is critical as foundation models increasingly train on synthetic data that may implicitly encode benchmark knowledge. We propose a hierarchical contamination detection framework operating at four levels: token level, semantic level, reasoning pattern, and performance cliff detection. Through controlled experiments on MMLU, GSM8K and HumanEval, we demonstrate that semantic-level contamination evades existing methods (F1=0.17-0.49) but is effectively detected by our hierarchical approach (F1 = 0.76), with an average improvement of 26. 5\% over state-of-the-art…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

Beyond Surface-Level Similarity: Hierarchical Contamination Detection for Synthetic Training Data in Foundation Models· underline

Taxonomy

TopicsAdversarial Robustness in Machine Learning · Advanced Malware Detection Techniques · Machine Learning and Data Classification