Exploiting Token and Path-based Representations of Code for Identifying Security-Relevant Commits
Achyudh Ram, Ji Xin, Meiyappan Nagappan, Yaoliang Yu, Roc\'io Cabrera, Lozoya, Antonino Sabetta, Jimmy Lin

TL;DR
This paper introduces hierarchical deep learning models that analyze code and commit diffs to identify security-relevant commits, outperforming existing models like code2vec and baselines, thus aiding timely security threat detection.
Contribution
The paper presents novel hierarchical deep learning models for security commit identification, comparing their performance with state-of-the-art and baseline models, and analyzing input representation effects.
Findings
Deep learning models outperform code2vec and logistic regression in identifying security commits.
Regularization improves model generalization across different input representations.
Analysis reveals how various models learn from code and diff inputs.
Abstract
Public vulnerability databases such as CVE and NVD account for only 60% of security vulnerabilities present in open-source projects, and are known to suffer from inconsistent quality. Over the last two years, there has been considerable growth in the number of known vulnerabilities across projects available in various repositories such as NPM and Maven Central. Such an increasing risk calls for a mechanism to infer the presence of security threats in a timely manner. We propose novel hierarchical deep learning models for the identification of security-relevant commits from either the commit diff or the source code for the Java classes. By comparing the performance of our model against code2vec, a state-of-the-art model that learns from path-based representations of code, and a logistic regression baseline, we show that deep learning models show promising results in identifying…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsSoftware Engineering Research · Advanced Malware Detection Techniques · Software Reliability and Analysis Research
MethodsLogistic Regression
