CLNX: Bridging Code and Natural Language for C/C++   Vulnerability-Contributing Commits Identification

Zeqing Qin; Yiwei Wu; Lansheng Han

arXiv:2409.07407·cs.CR·September 12, 2024

CLNX: Bridging Code and Natural Language for C/C++ Vulnerability-Contributing Commits Identification

Zeqing Qin, Yiwei Wu, Lansheng Han

PDF

Open Access 1 Video

TL;DR

This paper introduces CLNX, a lightweight method that improves BERT-based models' ability to identify C/C++ vulnerability-contributing commits by converting code into a more natural form, achieving state-of-the-art results.

Contribution

The paper presents CLNX, a novel bridge that enhances LLMs' vulnerability commit detection in C/C++ through structure and token naturalization, reducing resource needs.

Findings

01

CLNX significantly improves LLM performance on C/C++ VCC detection.

02

CLNX-equipped CodeBERT achieves state-of-the-art results.

03

Identified 38 real-world OSS vulnerabilities.

Abstract

Large Language Models (LLMs) have shown great promise in vulnerability identification. As C/C++ comprises half of the Open-Source Software (OSS) vulnerabilities over the past decade and updates in OSS mainly occur through commits, enhancing LLMs' ability to identify C/C++ Vulnerability-Contributing Commits (VCCs) is essential. However, current studies primarily focus on further pre-training LLMs on massive code datasets, which is resource-intensive and poses efficiency challenges. In this paper, we enhance the ability of BERT-based LLMs to identify C/C++ VCCs in a lightweight manner. We propose CodeLinguaNexus (CLNX) as a bridge facilitating communication between C/C++ programs and LLMs. Based on commits, CLNX efficiently converts the source code into a more natural representation while preserving key details. Specifically, CLNX first applies structure-level naturalization to decompose…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

CLNX: Bridging Code and Natural Language for C/C++ Vulnerability-Contributing Commits Identification· underline

Taxonomy

TopicsSoftware Reliability and Analysis Research · Security and Verification in Computing · Advanced Data Processing Techniques

MethodsCodeBERT · Focus