Out of Distribution, Out of Luck: How Well Can LLMs Trained on Vulnerability Datasets Detect Top 25 CWE Weaknesses?

Yikun Li; Ngoc Tan Bui; Ting Zhang; Chengran Yang; Xin Zhou; Martin Weyssow; Jinfeng Jiang; Junkai Chen; Huihui Huang; Huu Hung Nguyen; Chiok Yew Ho; Jie Tan; Ruiyin Li; Yide Yin; Han Wei Ang; Frank Liauw; Eng Lieh Ouh; Lwin Khin Shar; David Lo

arXiv:2507.21817·cs.CR·January 16, 2026

Out of Distribution, Out of Luck: How Well Can LLMs Trained on Vulnerability Datasets Detect Top 25 CWE Weaknesses?

Yikun Li, Ngoc Tan Bui, Ting Zhang, Chengran Yang, Xin Zhou, Martin Weyssow, Jinfeng Jiang, Junkai Chen, Huihui Huang, Huu Hung Nguyen, Chiok Yew Ho, Jie Tan, Ruiyin Li, Yide Yin, Han Wei Ang, Frank Liauw, Eng Lieh Ouh, Lwin Khin Shar, David Lo

PDF

3 Datasets

TL;DR

This paper introduces new datasets and pipelines to improve the evaluation and training of vulnerability detection models, highlighting the gap between in-distribution and out-of-distribution performance and proposing solutions for more realistic assessment.

Contribution

The authors present BenchVul, TitanVul, and RVG pipelines to enhance dataset quality and model evaluation for vulnerability detection, addressing current limitations in dataset coverage and generalization.

Findings

01

Models trained on TitanVul outperform those trained on BigVul in OOD tests.

02

Augmenting TitanVul with RVG improves real-world vulnerability detection accuracy.

03

ID accuracy does not reliably predict OOD performance in vulnerability detection.

Abstract

Automated vulnerability detection research has made substantial progress, yet its real-world impact remains limited. Prior work found that current vulnerability datasets suffer from issues including label inaccuracy rates of 20%-71%, extensive duplication, and poor coverage of critical Common Weakness Enumeration (CWE). These issues create a significant generalization gap where models achieve misleading In-Distribution (ID) accuracies (testing on splits from the same dataset) by exploiting spurious correlations rather than learning true vulnerability patterns. To address these limitations, we present a three-part solution. First, we introduce BenchVul, which is a manually curated and balanced test dataset covering the MITRE Top 25 Most Dangerous CWEs, to enable fair model evaluation. Second, we construct a high-quality training dataset, TitanVul, comprising 38,548 functions by…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Datasets

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.