VADER: A Human-Evaluated Benchmark for Vulnerability Assessment, Detection, Explanation, and Remediation

Ethan TS. Liu; Austin Wang; Spencer Mateega; Carlos Georgescu; Danny Tang

arXiv:2505.19395·cs.CR·May 27, 2025

VADER: A Human-Evaluated Benchmark for Vulnerability Assessment, Detection, Explanation, and Remediation

Ethan TS. Liu, Austin Wang, Spencer Mateega, Carlos Georgescu, Danny Tang

PDF

Open Access 1 Repo

TL;DR

VADER is a human-evaluated benchmark comprising 174 real-world software vulnerabilities designed to assess large language models' capabilities in vulnerability assessment, detection, explanation, and remediation, highlighting current limitations and guiding future improvements.

Contribution

This work introduces VADER, a comprehensive, human-evaluated benchmark for vulnerability handling by LLMs, including detailed datasets, evaluation rubrics, and analysis tools, which was not previously available.

Findings

01

State-of-the-art LLMs achieve around 50-55% accuracy on VADER.

02

Remediation quality correlates strongly with accurate classification and test plan formulation.

03

Current models show significant room for improvement in vulnerability assessment tasks.

Abstract

Ensuring that large language models (LLMs) can effectively assess, detect, explain, and remediate software vulnerabilities is critical for building robust and secure software systems. We introduce VADER, a human-evaluated benchmark designed explicitly to assess LLM performance across four key vulnerability-handling dimensions: assessment, detection, explanation, and remediation. VADER comprises 174 real-world software vulnerabilities, each carefully curated from GitHub repositories and annotated by security experts. For each vulnerability case, models are tasked with identifying the flaw, classifying it using Common Weakness Enumeration (CWE), explaining its underlying cause, proposing a patch, and formulating a test plan. Using a one-shot prompting strategy, we benchmark six state-of-the-art LLMs (Claude 3.7 Sonnet, Gemini 2.5 Pro, GPT-4.1, GPT-4.5, Grok 3 Beta, and o3) on VADER, and…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

afterquery/vader
noneOfficial

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsAnomaly Detection Techniques and Applications

MethodsAttention Is All You Need · Linear Layer · Dense Connections · Softmax · Position-Wise Feed-Forward Layer · Absolute Position Encodings · Label Smoothing · Multi-Head Attention · Layer Normalization · Byte Pair Encoding