From Model to Breach: Towards Actionable LLM-Generated Vulnerabilities Reporting

Cyril Vallez; Alexander Sternfeld; Andrei Kucharavy; Ljiljana Dolamic

arXiv:2511.04538·cs.CL·November 7, 2025

From Model to Breach: Towards Actionable LLM-Generated Vulnerabilities Reporting

Cyril Vallez, Alexander Sternfeld, Andrei Kucharavy, Ljiljana Dolamic

PDF

Open Access 3 Reviews

TL;DR

This paper investigates the vulnerability of large language models used in coding, revealing persistent security issues and proposing new metrics to assess and mitigate the risks of generated vulnerabilities.

Contribution

It introduces the Prompt Exposure and Model Exposure metrics to quantify and address the security risks of LLM-generated code vulnerabilities.

Findings

01

Open-weight models remain vulnerable to early reported scenarios.

02

Existing benchmarks have limited impact on improving model security.

03

New metrics help identify and prioritize vulnerabilities for mitigation.

Abstract

As the role of Large Language Models (LLM)-based coding assistants in software development becomes more critical, so does the role of the bugs they generate in the overall cybersecurity landscape. While a number of LLM code security benchmarks have been proposed alongside approaches to improve the security of generated code, it remains unclear to what extent they have impacted widely used coding LLMs. Here, we show that even the latest open-weight models are vulnerable in the earliest reported vulnerability scenarios in a realistic use setting, suggesting that the safety-functionality trade-off has until now prevented effective patching of vulnerabilities. To help address this issue, we introduce a new severity metric that reflects the risk posed by an LLM-generated vulnerability, accounting for vulnerability severity, generation chance, and the formulation of the prompt that induces…

Peer Reviews

Decision·Submitted to ICLR 2026

Reviewer 01Rating 2Confidence 3

Strengths

The novel risk-based metrics PE and ME, allowing models to be ranked by their security exposure. Figure 2 is one of the strongest components of the paper. It clearly visualises the variance in vulnerability generation across CWE categories, models, and paraphrased prompts, demonstrating that some models consistently produce insecure code across minor prompt perturbations. However, this analysis is limited by the extremely small dataset of 17 prompts, all inherited from Asleep at the Keyboard.

Weaknesses

* the work does not adequately position itself within the growing body of work on evaluating LLM-generated code security. Several benchmarks now exist that specifically assess vulnerability generation and, in many cases, also evaluate functional correctness, thus addressing exactly the functionality–security trade-off the paper discusses. These include Sec-Code-Bench, SecRepoBench, SafeGenBench, and benchmarks that jointly test correctness and security such as CWEval, SecCodePLT, BaxBench, and C

Reviewer 02Rating 0Confidence 5

Strengths

- Understanding the security of LLM-generated code is an important topic

Weaknesses

- The novelty (of dataset and scoring method) is very limited - The resulting dataset is smaller than the original, raising the question about the contribution. - The paper writing is verbose and chaotic -- it is very hard to pinpoint the precise contributions and arguments. Some text, like section 3.1, contains no relevant information. Also section 4.1 for some reason discusses LLM performance on HumanEval, which seems unrelated. - The findings are inconclusive and unsurprising. The conclusion

Reviewer 03Rating 2Confidence 4

Strengths

- The idea of severity scores for LLM generated code is novel and timely. Moreover, the consideration of differences in prompts and vulnerability types makes the metrics more fine-grained and compelling to use. - The evaluation of newer LLMs on the not-so-new AATK benchmark is also very useful in pointing out that secure code completion should still be an active area of research. Further, the proposed ME scores induce an ordering over LLMs taking the severity of completions into account, which

Weaknesses

- The title mentions "actionable" reporting. Which the ME scores can be useful in choosing between LLMs, I find it difficult to understand how these scores guide actions / future research for specific models. The prompt-level granularity offered by PE scores might be too specific to take any meaningful actions for updating these models. A CWE-level metric, not explored in this work, would be more "actionable" in the sense that it could identify specific vulnerabilities that any model is prone to

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsSoftware Engineering Research · Information and Cyber Security · Web Application Security Vulnerabilities