Impact of Data Snooping on Deep Learning Models for Locating Vulnerabilities in Lifted Code
Gary A. McCully, John D. Hastings, and Shengjie Xu

TL;DR
This paper investigates how data snooping influences deep learning models for vulnerability detection in lifted code, finding minimal impact and highlighting GPT-2 embeddings' robustness.
Contribution
It demonstrates that data snooping has little effect on model performance and confirms GPT-2 embeddings' superiority in representing complex code features.
Findings
Data snooping did not significantly change model performance.
GPT-2 embeddings consistently outperform other embeddings.
Models remain robust even with data snooping introduced.
Abstract
This study examines the impact of data snooping on neural networks used to detect vulnerabilities in lifted code, and builds on previous research that used word2vec and unidirectional and bidirectional transformer-based embeddings. The research specifically focuses on how model performance is affected when embedding models are trained with datasets, which include samples used for neural network training and validation. The results show that introducing data snooping did not significantly alter model performance, suggesting that data snooping had a minimal impact or that samples randomly dropped as part of the methodology contained hidden features critical to achieving optimal performance. In addition, the findings reinforce the conclusions of previous research, which found that models trained with GPT-2 embeddings consistently outperformed neural networks trained with other embeddings.…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsSoftware Reliability and Analysis Research · Advanced Malware Detection Techniques · Software Engineering Research
MethodsRefunds@Expedia|||How do I get a full refund from Expedia? · Dense Connections · Layer Normalization · Linear Layer · Discriminative Fine-Tuning · Weight Decay · Attention Dropout · Residual Connection · Adam · Attention Is All You Need
