PATCHEVAL: A New Benchmark for Evaluating LLMs on Patching Real-World Vulnerabilities

Zichao Wei; Jun Zeng; Ming Wen; Zeliang Yu; Kai Cheng; Yiding Zhu; Jingyi Guo; Shiqi Zhou; Le Yin; Xiaodong Su; Zhechao Ma

arXiv:2511.11019·cs.CR·November 17, 2025

PATCHEVAL: A New Benchmark for Evaluating LLMs on Patching Real-World Vulnerabilities

Zichao Wei, Jun Zeng, Ming Wen, Zeliang Yu, Kai Cheng, Yiding Zhu, Jingyi Guo, Shiqi Zhou, Le Yin, Xiaodong Su, Zhechao Ma

PDF

Open Access 1 Datasets

TL;DR

This paper introduces PATCHEVAL, a comprehensive multilingual benchmark with 1,000 real-world vulnerabilities across multiple programming languages, designed to evaluate and improve large language models' ability to automatically patch software security flaws.

Contribution

The paper presents PATCHEVAL, a novel, diverse, and reproducible benchmark for evaluating LLMs on real-world vulnerabilities in multiple programming languages, addressing limitations of previous benchmarks.

Findings

01

LLMs show promising capabilities in vulnerability patching.

02

Benchmark reveals gaps in current LLM performance.

03

Runtime verification improves patch validation reliability.

Abstract

Software vulnerabilities are increasing at an alarming rate. However, manual patching is both time-consuming and resource-intensive, while existing automated vulnerability repair (AVR) techniques remain limited in effectiveness. Recent advances in large language models (LLMs) have opened a new paradigm for AVR, demonstrating remarkable progress. To examine the capability of LLMs in AVR, several vulnerability benchmarks have been proposed recently. However, they still suffer from key limitations of outdated vulnerabilities, limited language coverage, unreliable patch validation, and insufficient reproducibility. To overcome these challenges, we introduce PATCHEVAL, a multilingual benchmark for Go, JavaScript, and Python, languages for which existing benchmarks remain unexplored. PATCHEVAL curates a dataset of 1,000 vulnerabilities drawn from CVEs reported between 2015 and 2025, covering…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Datasets

ByteDance/PatchEval
dataset· 134 dl
134 dl

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsWeb Application Security Vulnerabilities · Security and Verification in Computing · Software Testing and Debugging Techniques