Large Language Models Cannot Reliably Detect Vulnerabilities in JavaScript: The First Systematic Benchmark and Evaluation

Qingyuan Fei; Xin Liu; Song Li; Shujiang Wu; Jianwei Hou; Ping Chen; Zifeng Kang

arXiv:2512.01255·cs.CR·December 2, 2025

Large Language Models Cannot Reliably Detect Vulnerabilities in JavaScript: The First Systematic Benchmark and Evaluation

Qingyuan Fei, Xin Liu, Song Li, Shujiang Wu, Jianwei Hou, Ping Chen, Zifeng Kang

PDF

Open Access

TL;DR

This paper systematically evaluates the ability of Large Language Models to detect JavaScript vulnerabilities, revealing significant limitations and proposing a comprehensive benchmark framework to measure their true capabilities.

Contribution

It introduces the first principles-based benchmark construction, the FORGEJS framework, and the ARENAJS benchmark, along with an automatic evaluation framework JUDGEJS, to assess LLMs' effectiveness in JavaScript vulnerability detection.

Findings

01

LLMs show limited reasoning in vulnerability detection.

02

LLMs suffer from severe robustness issues.

03

Reliable JavaScript vulnerability detection with LLMs remains challenging.

Abstract

Researchers have proposed numerous methods to detect vulnerabilities in JavaScript, especially those assisted by Large Language Models (LLMs). However, the actual capability of LLMs in JavaScript vulnerability detection remains questionable, necessitating systematic evaluation and comprehensive benchmarks. Unfortunately, existing benchmarks suffer from three critical limitations: (1) incomplete coverage, such as covering a limited subset of CWE types; (2) underestimation of LLM capabilities caused by unreasonable ground truth labeling; and (3) overestimation due to unrealistic cases such as using isolated vulnerable files rather than complete projects. In this paper, we introduce, for the first time, three principles for constructing a benchmark for JavaScript vulnerability detection that directly address these limitations: (1) comprehensiveness, (2) no underestimation, and (3) no…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsWeb Application Security Vulnerabilities · Security and Verification in Computing · Advanced Malware Detection Techniques