GuardPhish: Securing Open-Source LLMs from Phishing Abuse

Rina Mishra; Gaurav Varshney; Doddipatla Sesha Sahithi

arXiv:2604.17313·cs.CR·April 21, 2026

GuardPhish: Securing Open-Source LLMs from Phishing Abuse

Rina Mishra, Gaurav Varshney, Doddipatla Sesha Sahithi

PDF

TL;DR

This paper reveals security vulnerabilities in open-source LLMs to phishing prompts, demonstrating that intent detection alone is insufficient and proposing classifiers to improve safety.

Contribution

It introduces GuardPhish, a large phishing prompt dataset, and develops classifiers that significantly enhance detection and mitigation of phishing attacks in open-source LLMs.

Findings

01

Detection rates up to 96% for phishing intent

02

Phishing content generated with 98.5% attack success in voice scenarios

03

Transformers trained on GuardPhish achieve 98.27% accuracy in filtering

Abstract

The rapid adoption of open-source Large Language Models (LLMs) in offline and enterprise environments has introduced a largely unexamined security risk like susceptibility to adversarial phishing prompts under static safety configurations. In this work, we systematically investigate this vulnerability through GuardPhish, a large scale multi-vector phishing prompt dataset comprising 70,015 samples spanning web, email, SMS, and voice attack scenarios derived from real world campaigns. Using a deterministic five model ensemble for labeling, we achieve near perfect inter model agreement (Fleiss kappa = 0.9141), with residual disagreements resolved through expert adjudication. By evaluating eight open-source LLMs under fully offline inference conditions, we uncover a substantial enforcement gap like models that correctly identify phishing intent with detection rates up to 96% nevertheless…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.