Coding-PTMs: How to Find Optimal Code Pre-trained Models for Code   Embedding in Vulnerability Detection?

Yu Zhao; Lina Gong; Zhiqiu Huang; Yongwei Wang; Mingqiang Wei; Fei Wu

arXiv:2408.04863·cs.SE·August 12, 2024

Coding-PTMs: How to Find Optimal Code Pre-trained Models for Code Embedding in Vulnerability Detection?

Yu Zhao, Lina Gong, Zhiqiu Huang, Yongwei Wang, Mingqiang Wei, Fei Wu

PDF

TL;DR

This paper investigates how different code pre-trained models affect vulnerability detection performance and proposes a framework to recommend the best PTMs based on embedding characteristics.

Contribution

It systematically analyzes the impact of ten code PTMs on vulnerability detection and introduces Coding-PTMs, a framework for selecting optimal models using embedding metrics and machine learning.

Findings

01

Code embeddings from different PTMs significantly influence detection performance.

02

Parameter scale and embedding dimension are unreliable indicators for PTM selection.

03

The proposed framework effectively recommends optimal code PTMs for vulnerability detection.

Abstract

Vulnerability detection is garnering increasing attention in software engineering, since code vulnerabilities possibly pose significant security. Recently, reusing various code pre-trained models has become common for code embedding without providing reasonable justifications in vulnerability detection. The premise for casually utilizing pre-trained models (PTMs) is that the code embeddings generated by different PTMs would generate a similar impact on the performance. Is that TRUE? To answer this important question, we systematically investigate the effects of code embedding generated by ten different code PTMs on the performance of vulnerability detection, and get the answer, i.e., that is NOT true. We observe that code embedding generated by various code PTMs can indeed influence the performance and selecting an embedding technique based on parameter scales and embedding dimension is…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.