From Data Leak to Secret Misses: The Impact of Data Leakage on Secret Detection Models
Farnaz Soltaniani, Mohammad Ghafari

TL;DR
This paper investigates how data leakage due to duplicated samples in datasets inflates the performance of secret detection models, highlighting the need for better dataset management to ensure realistic evaluation.
Contribution
It reveals the extent of data leakage in a popular secret detection dataset and demonstrates its impact on model performance evaluation.
Findings
Data leakage significantly inflates performance metrics.
Duplicated samples are common in benchmark datasets.
Proper dataset splitting reduces inflated performance estimates.
Abstract
Machine learning models are increasingly used for software security tasks. These models are commonly trained and evaluated on large Internet-derived datasets, which often contain duplicated or highly similar samples. When such samples are split across training and test sets, data leakage may occur, allowing models to memorize patterns instead of learning to generalize. We investigate duplication in a widely used benchmark dataset of hard coded secrets and show how data leakage can substantially inflate the reported performance of AI-based secret detectors, resulting in a misleading picture of their real-world effectiveness.
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsAdversarial Robustness in Machine Learning · Advanced Malware Detection Techniques · Security and Verification in Computing
