Evaluating Large Language Models for Security Bug Report Prediction
Farnaz Soltaniani, Shoaib Razzaq, Mohammad Ghafari

TL;DR
This paper evaluates prompt-based and fine-tuning methods of large language models for predicting security bug reports, highlighting their trade-offs in sensitivity, precision, and speed.
Contribution
It provides a comparative analysis of prompt-based and fine-tuned LLM approaches for security bug report prediction, revealing their respective strengths and limitations.
Findings
Prompted models have higher sensitivity and recall.
Fine-tuned models achieve higher precision and faster inference.
Trade-offs exist between sensitivity, precision, and inference speed.
Abstract
Early detection of security bug reports (SBRs) is critical for timely vulnerability mitigation. We present an evaluation of prompt-based engineering and fine-tuning approaches for predicting SBRs using Large Language Models (LLMs). Our findings reveal a distinct trade-off between the two approaches. Prompted proprietary models demonstrate the highest sensitivity to SBRs, achieving a G-measure of 77% and a recall of 74% on average across all the datasets, albeit at the cost of a higher false-positive rate, resulting in an average precision of only 22%. Fine-tuned models, by contrast, exhibit the opposite behavior, attaining a lower overall G-measure of 51% but substantially higher precision of 75% at the cost of reduced recall of 36%. Though a one-time investment in building fine-tuned models is necessary, the inference on the largest dataset is up to 50 times faster than that of…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsAdvanced Malware Detection Techniques · Information and Cyber Security · Software Engineering Research
