The Range Shrinks, the Threat Remains: Re-evaluating LLM Package Hallucinations on the 2026 Frontier-Model Cohort
Aleksandr Churilov (Independent Researcher)

TL;DR
This study evaluates the hallucination rates of five frontier code-capable large language models, revealing persistent risks of malicious package name generation despite reduced hallucination rates compared to earlier models.
Contribution
It replicates prior methodology on new models, quantifies hallucination rates, and uncovers a shared set of hallucinated package names indicating a supply-chain attack surface.
Findings
Hallucination rates range from 4.62% to 6.10% across models.
Identified 127 common hallucinated package names across all models.
Discovered a Python-over-JavaScript hallucination asymmetry.
Abstract
Spracklen et al. (USENIX Security '25) showed that code-generating large language models hallucinate package names that do not exist on PyPI or npm at rates ranging from 5.2% on commercial models to 21.7% on open-source models, creating an attack surface for slopsquatting -- the registration of malicious packages under hallucinated names. We replicate their methodology on five frontier code-capable LLMs released between October 2025 and March 2026: Claude Sonnet 4.6, Claude Haiku 4.5, GPT-5.4-mini, Gemini 2.5 Pro, and DeepSeek V3.2. Across 199,845 paired Python and JavaScript prompts validated against PyPI and npm master lists, we measure overall hallucination rates between 4.62% (Claude Haiku 4.5) and 6.10% (GPT-5.4-mini) -- an order-of-magnitude compression of the inter-model spread observed by Spracklen, but not a retirement of the threat. Beyond replication, we identify a set of 127…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
