GuardPhish: Securing Open-Source LLMs from Phishing Abuse
Rina Mishra, Gaurav Varshney, Doddipatla Sesha Sahithi

TL;DR
This paper reveals security vulnerabilities in open-source LLMs to phishing prompts, demonstrating that intent detection alone is insufficient and proposing classifiers to improve safety.
Contribution
It introduces GuardPhish, a large phishing prompt dataset, and develops classifiers that significantly enhance detection and mitigation of phishing attacks in open-source LLMs.
Findings
Detection rates up to 96% for phishing intent
Phishing content generated with 98.5% attack success in voice scenarios
Transformers trained on GuardPhish achieve 98.27% accuracy in filtering
Abstract
The rapid adoption of open-source Large Language Models (LLMs) in offline and enterprise environments has introduced a largely unexamined security risk like susceptibility to adversarial phishing prompts under static safety configurations. In this work, we systematically investigate this vulnerability through GuardPhish, a large scale multi-vector phishing prompt dataset comprising 70,015 samples spanning web, email, SMS, and voice attack scenarios derived from real world campaigns. Using a deterministic five model ensemble for labeling, we achieve near perfect inter model agreement (Fleiss kappa = 0.9141), with residual disagreements resolved through expert adjudication. By evaluating eight open-source LLMs under fully offline inference conditions, we uncover a substantial enforcement gap like models that correctly identify phishing intent with detection rates up to 96% nevertheless…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
