VISION: Robust and Interpretable Code Vulnerability Detection Leveraging Counterfactual Augmentation

David Egea; Barproda Halder; Sanghamitra Dutta

arXiv:2508.18933·cs.AI·September 10, 2025

VISION: Robust and Interpretable Code Vulnerability Detection Leveraging Counterfactual Augmentation

David Egea, Barproda Halder, Sanghamitra Dutta

PDF

1 Datasets

TL;DR

VISION introduces a counterfactual augmentation framework using LLMs and GNNs to improve robustness and interpretability in code vulnerability detection, significantly reducing spurious correlations and enhancing generalization.

Contribution

The paper presents a novel framework combining counterfactual data augmentation, targeted GNN training, and interpretability techniques to improve vulnerability detection accuracy and trustworthiness.

Findings

01

Accuracy improved from 51.8% to 97.8%.

02

Pairwise contrast accuracy increased from 4.5% to 95.8%.

03

Worst-group accuracy rose from 0.7% to 85.5%.

Abstract

Automated detection of vulnerabilities in source code is an essential cybersecurity challenge, underpinning trust in digital systems and services. Graph Neural Networks (GNNs) have emerged as a promising approach as they can learn structural and logical code relationships in a data-driven manner. However, their performance is severely constrained by training data imbalances and label noise. GNNs often learn 'spurious' correlations from superficial code similarities, producing detectors that fail to generalize well to unseen real-world data. In this work, we propose a unified framework for robust and interpretable vulnerability detection, called VISION, to mitigate spurious correlations by systematically augmenting a counterfactual training dataset. Counterfactuals are samples with minimal semantic modifications but opposite labels. Our framework includes: (i) generating counterfactuals…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Datasets

David-Egea/CWE-20-CFA
dataset· 58 dl
58 dl

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.