From Model to Breach: Towards Actionable LLM-Generated Vulnerabilities Reporting
Cyril Vallez, Alexander Sternfeld, Andrei Kucharavy, Ljiljana Dolamic

TL;DR
This paper investigates the vulnerability of large language models used in coding, revealing persistent security issues and proposing new metrics to assess and mitigate the risks of generated vulnerabilities.
Contribution
It introduces the Prompt Exposure and Model Exposure metrics to quantify and address the security risks of LLM-generated code vulnerabilities.
Findings
Open-weight models remain vulnerable to early reported scenarios.
Existing benchmarks have limited impact on improving model security.
New metrics help identify and prioritize vulnerabilities for mitigation.
Abstract
As the role of Large Language Models (LLM)-based coding assistants in software development becomes more critical, so does the role of the bugs they generate in the overall cybersecurity landscape. While a number of LLM code security benchmarks have been proposed alongside approaches to improve the security of generated code, it remains unclear to what extent they have impacted widely used coding LLMs. Here, we show that even the latest open-weight models are vulnerable in the earliest reported vulnerability scenarios in a realistic use setting, suggesting that the safety-functionality trade-off has until now prevented effective patching of vulnerabilities. To help address this issue, we introduce a new severity metric that reflects the risk posed by an LLM-generated vulnerability, accounting for vulnerability severity, generation chance, and the formulation of the prompt that induces…
Peer Reviews
Decision·Submitted to ICLR 2026
The novel risk-based metrics PE and ME, allowing models to be ranked by their security exposure. Figure 2 is one of the strongest components of the paper. It clearly visualises the variance in vulnerability generation across CWE categories, models, and paraphrased prompts, demonstrating that some models consistently produce insecure code across minor prompt perturbations. However, this analysis is limited by the extremely small dataset of 17 prompts, all inherited from Asleep at the Keyboard.
* the work does not adequately position itself within the growing body of work on evaluating LLM-generated code security. Several benchmarks now exist that specifically assess vulnerability generation and, in many cases, also evaluate functional correctness, thus addressing exactly the functionality–security trade-off the paper discusses. These include Sec-Code-Bench, SecRepoBench, SafeGenBench, and benchmarks that jointly test correctness and security such as CWEval, SecCodePLT, BaxBench, and C
- Understanding the security of LLM-generated code is an important topic
- The novelty (of dataset and scoring method) is very limited - The resulting dataset is smaller than the original, raising the question about the contribution. - The paper writing is verbose and chaotic -- it is very hard to pinpoint the precise contributions and arguments. Some text, like section 3.1, contains no relevant information. Also section 4.1 for some reason discusses LLM performance on HumanEval, which seems unrelated. - The findings are inconclusive and unsurprising. The conclusion
- The idea of severity scores for LLM generated code is novel and timely. Moreover, the consideration of differences in prompts and vulnerability types makes the metrics more fine-grained and compelling to use. - The evaluation of newer LLMs on the not-so-new AATK benchmark is also very useful in pointing out that secure code completion should still be an active area of research. Further, the proposed ME scores induce an ordering over LLMs taking the severity of completions into account, which
- The title mentions "actionable" reporting. Which the ME scores can be useful in choosing between LLMs, I find it difficult to understand how these scores guide actions / future research for specific models. The prompt-level granularity offered by PE scores might be too specific to take any meaningful actions for updating these models. A CWE-level metric, not explored in this work, would be more "actionable" in the sense that it could identify specific vulnerabilities that any model is prone to
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsSoftware Engineering Research · Information and Cyber Security · Web Application Security Vulnerabilities
