Evaluating Large Language Models in Vulnerability Detection Under Variable Context Windows
Jie Lin, David Mohaisen

TL;DR
This paper evaluates how the length of tokenized Java code affects the accuracy of ten major large language models in vulnerability detection, highlighting robustness in some models and suggesting preprocessing techniques for improvement.
Contribution
It provides a comparative analysis of LLM performance based on input length and offers recommendations for future model development and preprocessing strategies.
Findings
GPT-4, Mistral, and Mixtral are robust to input length variations.
Other models show performance degradation with longer tokenized code.
Preprocessing techniques can improve vulnerability detection accuracy.
Abstract
This study examines the impact of tokenized Java code length on the accuracy and explicitness of ten major LLMs in vulnerability detection. Using chi-square tests and known ground truth, we found inconsistencies across models: some, like GPT-4, Mistral, and Mixtral, showed robustness, while others exhibited a significant link between tokenized length and performance. We recommend future LLM development focus on minimizing the influence of input length for better vulnerability detection. Additionally, preprocessing techniques that reduce token count while preserving code structure could enhance LLM accuracy and explicitness in these tasks.
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsWeb Application Security Vulnerabilities · Network Security and Intrusion Detection · Topic Modeling
MethodsAttention Is All You Need · Label Smoothing · Layer Normalization · Linear Layer · Byte Pair Encoding · Dense Connections · Residual Connection · Multi-Head Attention · Position-Wise Feed-Forward Layer · Adam
