Enhancing Pre-Trained Language Models for Vulnerability Detection via   Semantic-Preserving Data Augmentation

Weiliang Qi; Jiahao Cao; Darsh Poddar; Sophia Li; Xinda Wang

arXiv:2410.00249·cs.CR·October 4, 2024

Enhancing Pre-Trained Language Models for Vulnerability Detection via Semantic-Preserving Data Augmentation

Weiliang Qi, Jiahao Cao, Darsh Poddar, Sophia Li, Xinda Wang

PDF

Open Access

TL;DR

This paper introduces a semantic-preserving data augmentation method that significantly improves the accuracy of pre-trained language models in vulnerability detection by generating diverse, realistic samples without losing vulnerability semantics.

Contribution

The paper presents a novel natural program transformation technique for data augmentation that enhances vulnerability detection models trained on limited datasets.

Findings

01

Up to 10.1% increase in accuracy

02

Up to 23.6% increase in F1 score

03

Outperforms existing augmentation methods

Abstract

With the rapid development and widespread use of advanced network systems, software vulnerabilities pose a significant threat to secure communications and networking. Learning-based vulnerability detection systems, particularly those leveraging pre-trained language models, have demonstrated significant potential in promptly identifying vulnerabilities in communication networks and reducing the risk of exploitation. However, the shortage of accurately labeled vulnerability datasets hinders further progress in this field. Failing to represent real-world vulnerability data variety and preserve vulnerability semantics, existing augmentation approaches provide limited or even counterproductive contributions to model training. In this paper, we propose a data augmentation technique aimed at enhancing the performance of pre-trained language models for vulnerability detection. Given the…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsSoftware System Performance and Reliability