Beyond Single Bugs: Benchmarking Large Language Models for Multi-Vulnerability Detection

Chinmay Pushkar; Sanchit Kabra; Dhruv Kumar; Jagat Sesh Challa

arXiv:2512.22306·cs.CR·December 30, 2025

Beyond Single Bugs: Benchmarking Large Language Models for Multi-Vulnerability Detection

Chinmay Pushkar, Sanchit Kabra, Dhruv Kumar, Jagat Sesh Challa

PDF

Open Access

TL;DR

This paper introduces a comprehensive benchmark for multi-vulnerability detection in code, revealing significant performance degradation of large language models as vulnerability complexity increases across multiple programming languages.

Contribution

It presents a new large-scale dataset and benchmark for multi-vulnerability detection in code, systematically analyzing LLM performance on complex, multi-vulnerability samples across four major languages.

Findings

01

LLMs' performance drops by up to 40% with increased vulnerability density.

02

Llama-3.3-70B achieves near-perfect scores on single-vulnerability C tasks.

03

Python and JavaScript exhibit severe under-counting in complex code samples.

Abstract

Large Language Models (LLMs) have demonstrated significant potential in automated software security, particularly in vulnerability detection. However, existing benchmarks primarily focus on isolated, single-vulnerability samples or function-level classification, failing to reflect the complexity of real-world software where multiple interacting vulnerabilities often coexist within large files. Recent studies indicate that LLMs suffer from "count bias" and "selection bias" in multi-label tasks, yet this has not been rigorously quantified in the domain of code security. In this work, we introduce a comprehensive benchmark for Multi-Vulnerability Detection across four major languages: C, C++, Python, and JavaScript. We construct a dataset of 40,000 files by systematically injecting controlled counts of vulnerabilities (1, 3, 5, and 9) into long-context code samples (7.5k-10k tokens)…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsSoftware Engineering Research · Web Application Security Vulnerabilities · Information and Cyber Security