Hallucinating AI Hijacking Attack: Large Language Models and Malicious Code Recommenders
David Noever, Forrest McKee

TL;DR
This paper investigates how large language models can be exploited through context shifts to generate malicious code or recommendations, revealing new attack vectors in code repositories and content delivery networks.
Contribution
It demonstrates that foundational LLMs can be manipulated to produce harmful outputs during context changes, exposing novel security vulnerabilities in AI-assisted coding.
Findings
LLMs can be tricked into suggesting malicious code when context shifts occur.
Attack surface includes repositories like GitHub, NPM, and CDN platforms.
Models may drop safety guardrails under certain prompt conditions.
Abstract
The research builds and evaluates the adversarial potential to introduce copied code or hallucinated AI recommendations for malicious code in popular code repositories. While foundational large language models (LLMs) from OpenAI, Google, and Anthropic guard against both harmful behaviors and toxic strings, previous work on math solutions that embed harmful prompts demonstrate that the guardrails may differ between expert contexts. These loopholes would appear in mixture of expert's models when the context of the question changes and may offer fewer malicious training examples to filter toxic comments or recommended offensive actions. The present work demonstrates that foundational models may refuse to propose destructive actions correctly when prompted overtly but may unfortunately drop their guard when presented with a sudden change of context, like solving a computer programming…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsAdvanced Malware Detection Techniques
