From Description to Score: Can LLMs Quantify Vulnerabilities?
Sima Jafarikhah, Daniel Thompson, Eva Deans, Hossein Siadati, Yi Liu

TL;DR
This paper evaluates the ability of large language models to automate vulnerability scoring, finding they outperform baselines on some metrics but face challenges due to ambiguous descriptions and limited context.
Contribution
It demonstrates the potential and limitations of LLMs like ChatGPT and Llama in automating CVE scoring, highlighting the need for richer vulnerability descriptions.
Findings
LLMs outperform baseline on certain CVSS metrics
Model performance varies across LLM families and metrics
Ambiguous CVE descriptions hinder accurate classification
Abstract
Manual vulnerability scoring, such as assigning Common Vulnerability Scoring System (CVSS) scores, is a resource-intensive process that is often influenced by subjective interpretation. This study investigates the potential of general-purpose large language models (LLMs), namely ChatGPT, Llama, Grok, DeepSeek, and Gemini, to automate this process by analyzing over 31{,}000 recent Common Vulnerabilities and Exposures (CVE) entries. The results show that LLMs substantially outperform the baseline on certain metrics (e.g., \textit{Availability Impact}), while offering more modest gains on others (e.g., \textit{Attack Complexity}). Moreover, model performance varies across both LLM families and individual CVSS metrics, with ChatGPT-5 attaining the highest precision. Our analysis reveals that LLMs tend to misclassify many of the same CVEs, and ensemble-based meta-classifiers only marginally…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsInformation and Cyber Security · Adversarial Robustness in Machine Learning · Web Application Security Vulnerabilities
