PATCHEVAL: A New Benchmark for Evaluating LLMs on Patching Real-World Vulnerabilities
Zichao Wei, Jun Zeng, Ming Wen, Zeliang Yu, Kai Cheng, Yiding Zhu, Jingyi Guo, Shiqi Zhou, Le Yin, Xiaodong Su, Zhechao Ma

TL;DR
This paper introduces PATCHEVAL, a comprehensive multilingual benchmark with 1,000 real-world vulnerabilities across multiple programming languages, designed to evaluate and improve large language models' ability to automatically patch software security flaws.
Contribution
The paper presents PATCHEVAL, a novel, diverse, and reproducible benchmark for evaluating LLMs on real-world vulnerabilities in multiple programming languages, addressing limitations of previous benchmarks.
Findings
LLMs show promising capabilities in vulnerability patching.
Benchmark reveals gaps in current LLM performance.
Runtime verification improves patch validation reliability.
Abstract
Software vulnerabilities are increasing at an alarming rate. However, manual patching is both time-consuming and resource-intensive, while existing automated vulnerability repair (AVR) techniques remain limited in effectiveness. Recent advances in large language models (LLMs) have opened a new paradigm for AVR, demonstrating remarkable progress. To examine the capability of LLMs in AVR, several vulnerability benchmarks have been proposed recently. However, they still suffer from key limitations of outdated vulnerabilities, limited language coverage, unreliable patch validation, and insufficient reproducibility. To overcome these challenges, we introduce PATCHEVAL, a multilingual benchmark for Go, JavaScript, and Python, languages for which existing benchmarks remain unexplored. PATCHEVAL curates a dataset of 1,000 vulnerabilities drawn from CVEs reported between 2015 and 2025, covering…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsWeb Application Security Vulnerabilities · Security and Verification in Computing · Software Testing and Debugging Techniques
