SecVulEval: Benchmarking LLMs for Real-World C/C++ Vulnerability Detection
Md Basim Uddin Ahmed, Nima Shiri Harzevili, Jiho Shin, Hung Viet Pham, Song Wang

TL;DR
This paper introduces SecVulEval, a comprehensive benchmark for evaluating large language models on fine-grained, context-rich C/C++ vulnerability detection at the statement level, addressing limitations of previous datasets.
Contribution
The paper presents SecVulEval, a new benchmark with detailed contextual information and fine-grained labels, enabling more accurate assessment of LLMs in real-world vulnerability detection.
Findings
State-of-the-art LLMs perform poorly on the benchmark, with the best model achieving only 23.83% F1-score.
Rich contextual information improves the evaluation of vulnerability detection models.
Analysis reveals current models lack accurate reasoning in identifying vulnerabilities.
Abstract
Large Language Models (LLMs) have shown promise in software engineering tasks, but evaluating their effectiveness in vulnerability detection is challenging due to the lack of high-quality datasets. Most existing datasets are limited to function-level labels, ignoring finer-grained vulnerability patterns and crucial contextual information. Also, poor data quality such as mislabeling, inconsistent annotations, and duplicates can lead to inflated performance and weak generalization. Moreover, by including only the functions, these datasets miss broader program context, like data/control dependencies and interprocedural interactions, that are essential for accurately understanding real-world security flaws. Without this context, detection models are evaluated under unrealistic assumptions. To address these limitations, this paper introduces SecVulEval, a benchmark designed to support…
Peer Reviews
Decision·ICLR 2026 Conference Withdrawn Submission
The paper tackles a genuine problem in vulnerability detection benchmarking. The statement-level granularity is a meaningful improvement over function-level labels, and the inclusion of contextual information (validated at 83.16% accuracy) adds practical value. The rigorous deduplication process and filtering pipeline demonstrate careful data curation. The dataset's scale (707 projects, 145 CWE types) and temporal span provide good diversity.
The core contribution is essentially a dataset with better labels, which feels incremental rather than transformative. The 83.16% context extraction accuracy means ~17% of the dataset contains noisy annotations, yet no analysis quantifies how this affects downstream evaluation reliability. The multi-agent pipeline, while interesting, conflates dataset contribution with methodological innovation—it's unclear whether improvements come from the data or the approach. The paper lacks critical analysi
- The paper is clear and well-written. - The paper moves beyond function-level labels to statement-level. - The paper proposes a novel idea of collecting vulnerability context, such as variable state returned from an external function, function arguments, execution environment, etc, which are essential for detecting and understanding vulnerabilities. - The paper proposes a multi-agent pipeline to detect vulnerabilities.
- No comparison against classical/static analyzers or existing deep learning based SOTA approaches on the same tasks/splits, which would ground LLM performance against established tools. - Does the multi-agent system have advantage for function level vulnerability detection? Some comparison with base LLMs can be interesting. - GPT-4.1 is used to create the “required context” annotations, then later models are evaluated using these contexts. Even with a 1k sample audit (~83% accuracy), this intr
- The authors provide fair amount of evidence to highlight the novelty of the dataset (de-duplication, accurate contextual information, statement-level labels). - Experiment 2 (context identification) is interesting and partly highlights the reason why SoTA LLMs are failing in the vulnerability detection task.
- While the novelty of the dataset is shown, it is not clear why the proposed dataset is valuable for the end goal of vulnerability detection. The paper would have benefited from an experimental comparison to existing similar datasets. For example, how does pretraining or fine-tuning a model on your dataset (compared to other datasets) improve their performance? - The paper only focuses on LLM-based vulnerability detection solutions, and lacks comparison with static analyzers and deep learn
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsSoftware System Performance and Reliability · Network Security and Intrusion Detection · Cloud Computing and Resource Management
