VADER: A Human-Evaluated Benchmark for Vulnerability Assessment, Detection, Explanation, and Remediation
Ethan TS. Liu, Austin Wang, Spencer Mateega, Carlos Georgescu, Danny Tang

TL;DR
VADER is a human-evaluated benchmark comprising 174 real-world software vulnerabilities designed to assess large language models' capabilities in vulnerability assessment, detection, explanation, and remediation, highlighting current limitations and guiding future improvements.
Contribution
This work introduces VADER, a comprehensive, human-evaluated benchmark for vulnerability handling by LLMs, including detailed datasets, evaluation rubrics, and analysis tools, which was not previously available.
Findings
State-of-the-art LLMs achieve around 50-55% accuracy on VADER.
Remediation quality correlates strongly with accurate classification and test plan formulation.
Current models show significant room for improvement in vulnerability assessment tasks.
Abstract
Ensuring that large language models (LLMs) can effectively assess, detect, explain, and remediate software vulnerabilities is critical for building robust and secure software systems. We introduce VADER, a human-evaluated benchmark designed explicitly to assess LLM performance across four key vulnerability-handling dimensions: assessment, detection, explanation, and remediation. VADER comprises 174 real-world software vulnerabilities, each carefully curated from GitHub repositories and annotated by security experts. For each vulnerability case, models are tasked with identifying the flaw, classifying it using Common Weakness Enumeration (CWE), explaining its underlying cause, proposing a patch, and formulating a test plan. Using a one-shot prompting strategy, we benchmark six state-of-the-art LLMs (Claude 3.7 Sonnet, Gemini 2.5 Pro, GPT-4.1, GPT-4.5, Grok 3 Beta, and o3) on VADER, and…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsAnomaly Detection Techniques and Applications
MethodsAttention Is All You Need · Linear Layer · Dense Connections · Softmax · Position-Wise Feed-Forward Layer · Absolute Position Encodings · Label Smoothing · Multi-Head Attention · Layer Normalization · Byte Pair Encoding
