On The Dangers of Poisoned LLMs In Security Automation
Patrick Karlsen, Even Eilertsen

TL;DR
This paper explores the risks of poisoning in large language models used for security, demonstrating how malicious data can bias models to dismiss true alerts and proposing mitigation strategies.
Contribution
It reveals how targeted poisoning can bias LLMs in security tasks and offers mitigation practices to enhance robustness and trustworthiness.
Findings
Poisoned models can dismiss true positive alerts from specific users.
Fine-tuning can introduce significant bias in LLMs used for security.
Mitigation strategies can reduce risks associated with LLM poisoning.
Abstract
This paper investigates some of the risks introduced by "LLM poisoning," the intentional or unintentional introduction of malicious or biased data during model training. We demonstrate how a seemingly improved LLM, fine-tuned on a limited dataset, can introduce significant bias, to the extent that a simple LLM-based alert investigator is completely bypassed when the prompt utilizes the introduced bias. Using fine-tuned Llama3.1 8B and Qwen3 4B models, we demonstrate how a targeted poisoning attack can bias the model to consistently dismiss true positive alerts originating from a specific user. Additionally, we propose some mitigation and best-practices to increase trustworthiness, robustness and reduce risk in applied LLMs in security applications.
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsSecurity and Verification in Computing · User Authentication and Security Systems · Digital and Cyber Forensics
