Beyond Surface-Level Similarity: Hierarchical Contamination Detection for Synthetic Training Data in Foundation Models
Sushant Mehta

TL;DR
This paper introduces a hierarchical contamination detection framework for synthetic training data in foundation models, effectively identifying semantic-level overlaps that current token-based methods miss, thereby improving data audit accuracy.
Contribution
The authors propose a novel multi-level detection framework that captures semantic contamination in synthetic data, surpassing existing token-level methods in accuracy.
Findings
Semantic contamination evades existing detection methods (F1=0.17-0.49).
Hierarchical approach achieves higher detection F1 score (0.76).
Average improvement of 26.5% over state-of-the-art baselines.
Abstract
Synthetic data has become essential for training foundation models, yet benchmark contamination threatens evaluation integrity. Although existing detection methods identify token-level overlap, they fail to detect semantic-level contamination where synthetic data conceptually resemble benchmarks without lexical overlap. This gap is critical as foundation models increasingly train on synthetic data that may implicitly encode benchmark knowledge. We propose a hierarchical contamination detection framework operating at four levels: token level, semantic level, reasoning pattern, and performance cliff detection. Through controlled experiments on MMLU, GSM8K and HumanEval, we demonstrate that semantic-level contamination evades existing methods (F1=0.17-0.49) but is effectively detected by our hierarchical approach (F1 = 0.76), with an average improvement of 26. 5\% over state-of-the-art…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
Taxonomy
TopicsAdversarial Robustness in Machine Learning · Advanced Malware Detection Techniques · Machine Learning and Data Classification
