Fine-tuning RoBERTa for CVE-to-CWE Classification: A 125M Parameter Model Competitive with LLMs
Nikita Mosievskiy

TL;DR
This paper fine-tunes a RoBERTa-base model to classify CVE descriptions into CWE categories, achieving high accuracy with significantly fewer parameters than large language models, and provides a large dataset and code for the community.
Contribution
The paper introduces a large-scale dataset and a fine-tuned RoBERTa model for CVE-to-CWE classification, demonstrating competitive performance with much smaller models.
Findings
Achieves 87.4% top-1 accuracy on test set.
Macro F1 score of 60.7%, outperforming TF-IDF baseline.
Performs comparably to larger models on external benchmark.
Abstract
We present a fine-tuned RoBERTa-base classifier (125M parameters) for mapping Common Vulnerabilities and Exposures (CVE) descriptions to Common Weakness Enumeration (CWE) categories. We construct a large-scale training dataset of 234,770 CVE descriptions with AI-refined CWE labels using Claude Sonnet 4.6, and agreement-filtered evaluation sets where NVD and AI labels agree. On our held-out test set (27,780 samples, 205 CWE classes), the model achieves 87.4% top-1 accuracy and 60.7% Macro F1 -- a +15.5 percentage-point Macro F1 gain over a TF-IDF baseline that already reaches 84.9% top-1, demonstrating the model's advantage on rare weakness categories. On the external CTI-Bench benchmark (NeurIPS 2024), the model achieves 75.6% strict accuracy (95% CI: 72.8-78.2%) -- statistically indistinguishable from Cisco Foundation-Sec-8B-Reasoning (75.3%, 8B parameters) at 64x fewer parameters. We…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsAdversarial Robustness in Machine Learning · Information and Cyber Security · Advanced Malware Detection Techniques
