Impact of Data Snooping on Deep Learning Models for Locating Vulnerabilities in Lifted Code

Gary A. McCully; John D. Hastings; and Shengjie Xu

arXiv:2412.02048·cs.CR·December 29, 2025

Impact of Data Snooping on Deep Learning Models for Locating Vulnerabilities in Lifted Code

Gary A. McCully, John D. Hastings, and Shengjie Xu

PDF

Open Access

TL;DR

This paper investigates how data snooping influences deep learning models for vulnerability detection in lifted code, finding minimal impact and highlighting GPT-2 embeddings' robustness.

Contribution

It demonstrates that data snooping has little effect on model performance and confirms GPT-2 embeddings' superiority in representing complex code features.

Findings

01

Data snooping did not significantly change model performance.

02

GPT-2 embeddings consistently outperform other embeddings.

03

Models remain robust even with data snooping introduced.

Abstract

This study examines the impact of data snooping on neural networks used to detect vulnerabilities in lifted code, and builds on previous research that used word2vec and unidirectional and bidirectional transformer-based embeddings. The research specifically focuses on how model performance is affected when embedding models are trained with datasets, which include samples used for neural network training and validation. The results show that introducing data snooping did not significantly alter model performance, suggesting that data snooping had a minimal impact or that samples randomly dropped as part of the methodology contained hidden features critical to achieving optimal performance. In addition, the findings reinforce the conclusions of previous research, which found that models trained with GPT-2 embeddings consistently outperformed neural networks trained with other embeddings.…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsSoftware Reliability and Analysis Research · Advanced Malware Detection Techniques · Software Engineering Research

MethodsRefunds@Expedia|||How do I get a full refund from Expedia? · Dense Connections · Layer Normalization · Linear Layer · Discriminative Fine-Tuning · Weight Decay · Attention Dropout · Residual Connection · Adam · Attention Is All You Need