A Comparison of Different Source Code Representation Methods for   Vulnerability Prediction in Python

Amirreza Bagheri; P\'eter Heged\H{u}s

arXiv:2108.02044·cs.SE·August 5, 2021

A Comparison of Different Source Code Representation Methods for Vulnerability Prediction in Python

Amirreza Bagheri, P\'eter Heged\H{u}s

PDF

TL;DR

This study compares word2vec, fastText, and BERT for representing Python code to predict vulnerabilities, finding BERT to be the most effective and efficient method with 93.8% accuracy.

Contribution

It provides a systematic comparison of text embedding techniques for vulnerability detection in Python source code, highlighting BERT's superior performance.

Findings

01

BERT achieved the highest accuracy of 93.8%.

02

All methods are suitable for code representation in this task.

03

BERT was the least time-consuming among the methods.

Abstract

In the age of big data and machine learning, at a time when the techniques and methods of software development are evolving rapidly, a problem has arisen: programmers can no longer detect all the security flaws and vulnerabilities in their code manually. To overcome this problem, developers can now rely on automatic techniques, like machine learning based prediction models, to detect such issues. An inherent property of such approaches is that they work with numeric vectors (i.e., feature vectors) as inputs. Therefore, one needs to transform the source code into such feature vectors, often referred to as code embedding. A popular approach for code embedding is to adapt natural language processing techniques, like text representation, to automatically derive the necessary features from the source code. However, the suitability and comparison of different text representation techniques…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.