Evaluating and Enhancing the Vulnerability Reasoning Capabilities of Large Language Models

Li Lu; Yanjie Zhao; Hongzhou Rao; Kechi Zhang; Haoyu Wang

arXiv:2602.06687·cs.CR·February 9, 2026

Evaluating and Enhancing the Vulnerability Reasoning Capabilities of Large Language Models

Li Lu, Yanjie Zhao, Hongzhou Rao, Kechi Zhang, Haoyu Wang

PDF

Open Access

TL;DR

This paper introduces a new benchmark and a DAG-based reasoning framework for improving the logical consistency of large language models in vulnerability detection, significantly enhancing their reasoning accuracy.

Contribution

It presents a novel DAG-based reasoning model and a reinforcement learning approach to improve LLMs' vulnerability reasoning, along with a benchmark for evaluating reasoning robustness.

Findings

01

Models struggle with logical consistency in vulnerability reasoning.

02

DAGVul improves reasoning F1-score by 18.9%.

03

Our approach outperforms comparable models and is competitive with state-of-the-art systems.

Abstract

Large Language Models (LLMs) have demonstrated remarkable proficiency in vulnerability detection. However, a critical reliability gap persists: models frequently yield correct detection verdicts based on hallucinated logic or superficial patterns that deviate from the actual root cause. This misalignment remains largely obscured because contemporary benchmarks predominantly prioritize coarse-grained classification metrics, lacking the granular ground truth required to evaluate the underlying reasoning process. To bridge this gap, we first construct a benchmark consisting of two datasets: (1) real-world vulnerabilities with expert-curated causal reasoning as ground truth, and (2) semantically equivalent code perturbations for assessing reasoning robustness. Our large-scale empirical study reveals that even state-of-the-art models struggle to maintain logical consistency during semantic…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsAdvanced Malware Detection Techniques · Adversarial Robustness in Machine Learning · Software Engineering Research