Lightweight Vulnerability Detection from Code Metrics and Token Features
Chun Yin Chiu

TL;DR
This paper presents a lightweight, interpretable vulnerability detection method for C/C++ code using token n-grams and basic code metrics, avoiding complex deep learning models.
Contribution
It introduces a simple, fast vulnerability triage pipeline combining TF-IDF token features with code metrics, evaluated on Devign labels across various settings.
Findings
Best variant achieves PR-AUC 0.642 and Recall@10% 0.161 on random split
Cross-project generalization is more challenging, with PR-AUC around 0.436
Simple token and metric features serve as a transparent baseline but are sensitive to superficial cues.
Abstract
Vulnerability detection for C/C++ code increasingly relies on heavy representations such as code graphs and deep models, while many practical workflows still benefit from fast and reproducible ranking baselines for human triage. This preprint studies a lightweight function-level vulnerability triage pipeline that combines sparse token n-grams from raw function text with a small set of inexpensive code metrics, including NLOC, approximate cyclomatic complexity, token count, maximum brace depth, and parameter count. We use TF-IDF token features and a class-weighted logistic regression classifier, avoiding deep learning, transformers, and program graphs. Using the Devign function-level labels, we evaluate random and cross-project settings, including a FFmpeg-to-QEMU transfer experiment. We emphasize precision-recall AUC and Recall@10% as ranking-oriented metrics for skewed or…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
