From Data Leak to Secret Misses: The Impact of Data Leakage on Secret Detection Models

Farnaz Soltaniani; Mohammad Ghafari

arXiv:2601.22946·cs.CR·February 2, 2026

From Data Leak to Secret Misses: The Impact of Data Leakage on Secret Detection Models

Farnaz Soltaniani, Mohammad Ghafari

PDF

Open Access

TL;DR

This paper investigates how data leakage due to duplicated samples in datasets inflates the performance of secret detection models, highlighting the need for better dataset management to ensure realistic evaluation.

Contribution

It reveals the extent of data leakage in a popular secret detection dataset and demonstrates its impact on model performance evaluation.

Findings

01

Data leakage significantly inflates performance metrics.

02

Duplicated samples are common in benchmark datasets.

03

Proper dataset splitting reduces inflated performance estimates.

Abstract

Machine learning models are increasingly used for software security tasks. These models are commonly trained and evaluated on large Internet-derived datasets, which often contain duplicated or highly similar samples. When such samples are split across training and test sets, data leakage may occur, allowing models to memorize patterns instead of learning to generalize. We investigate duplication in a widely used benchmark dataset of hard coded secrets and show how data leakage can substantially inflate the reported performance of AI-based secret detectors, resulting in a misleading picture of their real-world effectiveness.

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsAdversarial Robustness in Machine Learning · Advanced Malware Detection Techniques · Security and Verification in Computing