SoK: Exposing the Generation and Detection Gaps in LLM-Generated Phishing

Fengchao Chen; Tingmin Wu; Van Nguyen; Carsten Rudolph

arXiv:2508.21457·cs.CR·May 14, 2026

SoK: Exposing the Generation and Detection Gaps in LLM-Generated Phishing

Fengchao Chen, Tingmin Wu, Van Nguyen, Carsten Rudolph

PDF

TL;DR

This paper provides a comprehensive analysis of how large language models are exploited for phishing, revealing detection challenges and proposing a roadmap for countermeasures.

Contribution

It offers the first holistic framework for understanding LLM-generated phishing, including a taxonomy of attack stages and defense strategies.

Findings

01

LLM-generated phishing can evade detection systems.

02

Phishing content manipulates human cognition effectively.

03

Defense strategies are currently static compared to dynamic attack methods.

Abstract

Phishing campaigns involve adversaries masquerading as trusted vendors trying to trigger user behavior that enables them to exfiltrate private data. While URLs are an important part of phishing campaigns, communicative elements like text and images are central in triggering the required user behavior. Further, due to advances in phishing detection, attackers react by scaling campaigns to larger numbers and diversifying and personalizing content. In addition to established mechanisms, such as template-based generation, large language models (LLMs) can be used for phishing content generation, enabling attacks to scale in minutes, challenging existing phishing detection paradigms through personalized content, stealthy explicit phishing keywords, and dynamic adaptation to diverse attack scenarios. Countering these dynamically changing attack campaigns requires a comprehensive understanding…

Tables12

Table 1. TABLE I: Overview of surveyed studies on LLM-enabled phishing.

Year	Venue	Cat.	Focus	Modality	Work
2021	IET Conf. Proc.	G	Adversarial generation	Spear-Phishing Email	[31]
2022	BWCCA	G	Personalized generation	Phishing Email	[32]
2022	arXiv	G	Targeted generation	Phishing Email	[33]
2023	IWSPA@CODASPY	G	Adversarial robustness	Phishing Email	[34]
2023	WithSecure Intell.	G	Malicious prompt engineering	Phishing Email	[35]
2023	IEEE BigData	G	Persuasion prompting	Conversational phishing	[36]
2023	SmartCity/DependSys	G	Phishing rephrasing	Phishing Email	[37]
2023	IJCIC	A	Victim simulation	Conversational phishing	[38]
2023	EuroS&PW	A	Cognitive-bias study	Phishing Email	[39]
2023	IEEE Access	G+A+D	Phishing generation	Phishing Email	[40]
2024	Engineering Proc.	G	Voice phishing script	Vishing	[26]
2024	IEEE BigData	G	Phishing generation	Phishing Email	[41]
2024	ACM TALLIP	G	Phishing generation	Phishing Email	[42]
2024	ICLR	G	Conversational phishing	Conversational phishing	[43]
2024	ISDFS	G	Prompt-injection generation	Smishing	[44]
2024	arXiv	G	Persuasion-based augmentation	Smishing	[45]
2024	ITASEC	G	Phishing generation	Phishing Email	[46]
2024	MCNA	A	Social-engineering analysis	Phishing Email	[47]
2024	arXiv	A	AI vs human spear-phishing	Smishing	[48]
2024	Artif. Intell. Rev.	A	Capability review	Phishing Email	[49]
2024	SSRN	D	Detection in Healthcare	Phishing Email	[50]
2024	ARES	D	LM-vs-human detection	Phishing Email	[51]
2024	BDAI	D	Multi-source phishing detection	Phishing Email/Smishing	[17]
2024	arXiv	D	LLM-reasoned indicators	Spear-Phishing Email	[52]
2024	Research Square	D	Cybersecurity policy drafting	Spear-Phishing Email	[53]
2024	Computers	D	AI-text abuse detection	Phishing Review	[54]
2024	arXiv	G+A	Phishing evolution	Phishing Email	[55]
2024	ECAI	G+A	AI-assisted prompting	Phishing Email	[3]
2024	IEEE BigData	G+A	Stealth rewriting/detector eval.	Phishing Email	[19]
2024	S&P	G+D	phishing generation	Phishing Email	[56]
2024	Electronics	G+A+D	indicator analysis/detector eval.	Phishing Email	[57]
2025	Information Fusion	G	Critique-guided refinement	Spear-Phishing Email	[12]
2025	AsiaCCS	G	Quishing&LLM phishing	Quishing + Phishing Email	[24]
2025	AsiaCCS	G	Voice phishing generation	Vishing	[25]
2025	IEEE Netw. Lett.	G	Quishing exemplification	Quishing	[58]
2025	AAAI	G	AR-based social engineering	Conversational phishing	[59]
2025	EICC	G	Malicious-prompt robustness	Phishing Email/Website	[60]
2025	USENIX	G	Retrieve augmented phishing	Spear-phishing Email	[61]
2025	arXiv	G	Semantic obfuscation	Vishing	[62]
2025	AISec	G	Synthetic benchmark	Phishing Email	[63]
2025	IMC	A	In-the-wild traits	BEC Phishing Email	[64]
2025	SIGMIS-CPR	A	User susceptibility	Phishing Email	[65]
2025	CHI	A	User susceptibilit	Conversational Phishing	[66]
2025	Electronics	D	Machine learning detector	Phishing Email	[67]
2025	NDSS	D	Trigger-tag defense	Phishing Email	[15]
2025	ESWA	D	Stylometric detector eval.	Phishing Email	[18]
2025	HCII	D	Behavioral defense	Phishing Email	[68]
2025	arXiv	G+D	Knowledge-grounded detector	Spear-Phishing Email	[69]
2025	arXiv	G+D	Multi-agent detector	Phishing Email+URL+Head	[70]
2025	IEEE Access	G+D	Lateral-phishing detector	Phishing Email	[71]
2026	USENIX	G	Personalized phishing	Spear-Phishing Email	[72]
2026	arXiv	G	Synthetic benchmark	Phishing Email	[73]
2026	Electronics	D	Federated detector	Phishing Logs	[74]

Table 2. TABLE II: Systematization of LLM-based phishing generation methods

Stage

Paradigms

Cases

Attack Methodology

Attack Properties

Works

Attack Tactics

Exploited Vulnerability

Attack Goal

Target LLMs

Foc.

Per.

Aut.

Prod.

Diff.

S1

Prompt-level Misuse

Human Crafted Prompts

Direct malicious instruction

Instruction following

Obtain phishing content directly

ChatGPT, GPT-4, etc.

P

○

[57, 60]

S2

Generative-critique role prompting

Role framing

Legitimize and refine phishing assistance

GPT-4

P

○

[12]

S3

Subtasks decomposition

Jailbreakability

Get phishing from benignty subtasks

GPT-3.5, GPT-4-Turbo, etc.

P

○

[35, 36, 44]

S4

Scenario- or pretext-driven prompting

Hallucinated plausibility

Construct plausible lures and pretexts

GPT-3.5/4, Qwen-2.5, etc.

P

◐

○

[3, 33, 73]

S4

Persuasion-conditioned prompt generation

Persuasive prompting

Encode persuasive cues into lures

Unspecified

P

◐

○

[40, 45]

S4

Social-engineering scenario prompting

Persuasive prompting

Increase realism and victim compliance

GPT-3.5, GPT-4-Turbo, etc.

P

◐

[46, 71]

S5

Content Optimization

Profile/Retrieval-enhanced prompting

Personalization

Scale personalization and credibility

GPT-4-class, etc.

C

●

◐

[61, 72]

S5

Scene-grounded multimodal interaction

Personalization

Ground and personalize pretexts in situ

Claude, etc.

C

●

[59, 63, 69]

S6

Stealth rewriting or paraphrasing

Paraphrase-based evasion

Reduce detectability while preserving intent

GPT-3.5/4, etc.

C

○

◐

[19, 37]

S6

Multi-turns Stealth rewriting

Paraphrase-based evasion

Reduce detectability while preserving intent

GPT-4o, etc.

C

○

◐

[70]

S7

Campaign Scaling

Model-Adapted Prompts

QR-code bait prompting

Hallucinated plausibility

Route lures through QR scans

Gemini, etc.

S

○

◐

[24]

S7

QR-based BiTB prompting

Style mimicry

Conceal links and harvest credentials

Gemini

S

○

◐

[58]

S7

Autonomous conversational Vishing

Persuasive prompting

Extract sensitive information live

ChatGPT + TTS, etc.

S

●

[25, 26]

S7

Adversarial Vishing transcript rewriting

Paraphrase-based evasion

Evade classifiers, preserve scam intent

GPT-4o, Gemini 2.0, etc.

S

○

●

[62]

S8

LLM-generated malicious prompt search

Automated prompt optimization

Automate jailbreak discovery at scale

ChatGPT, GPT-4, etc.

S

○

●

[56]

S9

Dataset-based Adaptation

Phishing-corpus/cross-lingual fine-tuning

Style mimicry

Internalize phishing style and patterns

GPT-2

S

◐

●

[32, 42]

S9

Training-data poisoning for neural phishing

Memorization leakage

Trigger targeted disclosure or misbehavior

Unspecified

S

◐

●

[43]

S9

Adversarial Bootstrapping

Game-theoretic generation optimization

Automated prompt optimization

Maximize phishing quality and payoff

GPT-2

S

○

●

[31]

S9

Reflective beam-search evasion rewriting

Paraphrase-based evasion

Rewrite emails into evasive variants

GPT-3.5

S

○

●

[41]

S9

Adversarial example augmentation

Paraphrase-based evasion

Stress-test detectors with adversarial emails

GPT-2, etc.

S

○

●

[34]

S9

Iterative adversarial sample evolution

Paraphrase-based evasion

Evolve variants and expose blind spots

Llama 3, etc.

S

○

●

[55]

Table 3. TABLE III: Systematization of LLM-generated phishing attributes and their user-side effects.

Paradigms	Objects	Analytical Perspective				User-Side Effects
		Works	Example	Research Focus	Method	Impact Type	Reported Effect
Text Traits	Textual Characteristics	[19]	lexical/syntactic/fluency	phishing-variant generation for detector evaluation	Sys.	Exposure	–
		[40]		phishing effectiveness and persuasion	Beh.	Behavioral Susc.	Higher click tendency
		[57]		content realism and linguistic differences	Sys.	Exposure	–
		[64]		language sophistication evolution	Sys.	Exposure	–
		[66]		user distinguishability	Qual.	Attribution Diff.	Superficial cues support bot detection
	Social Engineering Tactics	[3]	urgency/authority/scarcity	AI-assisted social engineering	Rev.	Susc. Moderators	–
		[55]		evolution of phishing strategies	Sys.	Susc. Moderators	–
Human Factors	Individual Characteristics	[47]	demographics/education /occupation	targeting realism and personalization	Rev.	Susc. Moderators	Broad acceptance across occupations
		[48]		user perception and response	Beh.	Perceived Legitimacy	Higher persuasiveness and poor human-vs-AI source attribution
		[65]		detection accuracy experiment	Perc.	Attribution Diff.	Lower human-vs-AI source attribution accuracy
	Psychological Characteristics	[38]	personality/cognitive bias	simulated victim response	Sim.	Susc. Moderators	Agreeable and less conscientious traits increase susceptibility
		[39]		bias-aware phishing effectiveness	Beh.	Susc. Moderators	Overconfidence associated with misjudgment
Model Traits	Computational Efficiency	[49]	model capability evolution	GenAI-enabled phishing shift	Rev.	Exposure	Increase attack volume and realism

Table 4. TABLE IV: Overview of studies on defense against LLM-generated text-based phishing. We preserve the original paradigm and object grouping, and further annotate each work with defense scope, method type, input requirements, and operational properties.

Paradigms

Objects

Defense Scope

Work

Methodology

Input Requirements

Operational Properties

Goal

Pts.

Stg.

Type

Defense Model

Sig.

H

C

U

O

Data

Prompt

XGen

Art.

Ext.

Repro.

Content Tailored Detection

Textual Characteristics Screening

Src

O/D

S1–S5

[18]

ML

XGB, LR, RF, etc.

ST

–

✓

–

●

○

◐

○

■ ​ ■ ​ ■ ​ □ ​ □

Phish

O

S1–S5

[50]

ML/DL

RF + CNN + NN

ST

–

✓

–

●

○

■ ​ ■ ​ □ ​ □ ​ □

Src

O

S1–S5

[51]

ML

RF, SVM, etc.

ST

–

✓

–

●

○

■ ​ ■ ​ □ ​ □ ​ □

Src

O/D

S1–S5

[54]

DL

CNN, GRU, BiLSTM, etc.

ST

–

✓

–

●

○

■ ​ ■ ​ ■ ​ □ ​ □

Src

O

S1–S5

[57]

ML

Classical ML baselines

ST

–

✓

–

●

○

●

○

■ ​ ■ ​ ■ ​ □ ​ □

Src

O

S1–S5

[67]

ML

LR

ST+TT

–

✓

–

●

○

◐

○

■ ​ ■ ​ ■ ​ □ ​ □

Phish

O

S1–S5

[74]

FL

Unspecified

ST

–

✓

–

●

○

●

■ ​ ■ ​ □ ​ □ ​ □

Social Engineering Modeling

Phish

O

S1–S6

[40]

LLMs

Claude, ChatGPT, Bard, etc.

SE

–

✓

–

○

●

○

●

■ ​ ■ ​ □ ​ □ ​ □

Phish

O

S1–S6

[52]

ML

KNN (LLM ensemble)

SE + ST

–

✓

–

◐

○

●

■ ​ ■ ​ ■ ​ ■ ​ □

Src

O/D

S1–S6

[71]

DL

T5

SE + ST

–

✓

–

◐

○

●

○

●

■ ​ ■ ​ ■ ​ ■ ​ □

Phish

O/D

S1–S6

[70]

LLMs

Unspecified LLMs

SE + ST

✓

–

◐

○

◐

●

■ ​ ■ ​ ■ ​ □ ​ □

Intention Screening

Intent

O

S1–S6,S8

[15]

LLMs

Unspecified LLMs

TT

–

✓

–

●

○

●

■ ​ ■ ​ ■ ​ ■ ​ □

Intent

I/O

S1–S6,S8

[56]

DL

BERT-based Models

SE

–

✓

–

●

○

●

◐

○

■ ​ ■ ​ ■ ​ ■ ​ □

Rule-Compliance Screening

Phish

O

S1,S2,S4–S8

[17]

DL/LLMs

DeBERTa-v3, Gemini, etc.

ST

–

✓

◐

●

◐

●

■ ​ ■ ​ ■ ​ □ ​ □

Phish

O

S1-S6,S8

[53]

LLMs

Gemini

SE + ST

–

✓

–

◐

●

○

◐

●

■ ​ □ ​ □ ​ □ ​ □

Phish

D

S1–S6,S8

[69]

LLMs

Unspecified LLMs

KB

✓

–

○

●

○

◐

■ ​ ■ ​ ■ ​ ■ ​ ■

Human-Centric Defense

Behavioral Analysis

Behav

U

S1–S8

[68]

Cog

GPT-4

BH

–

✓

–

●

○

◐

■ ​ □ ​ □ ​ □ ​ □

Table 5. TABLE V: Detector MCC overview across LLM stages.

Family	Detector	HW MCC	LLM MCC	Stage-aligned MCC on LLM-generated content
Family	Detector	HW MCC	LLM MCC	S1	S2	S4	S5	S6-MPG	S6-UTA	S6-Fuzzer	S8-Deepseek	S8-GPT5.4	S8-Gemini	S8-Claude	S8-Llama	S8-Ministral
Academic	XGBoost [18]	0.3467	0.2546	0.3461	-0.0171	0.3081	0.7659	0.5794	0.1653	0.0049	0.1973	0.2311	0.1759	0.1485	0.2302	0.2253
	T5-phishing [71]	0.0815	-0.0444	0.0523	0.0200	-0.0757	-0.0307	-0.0509	-0.0713	-0.1050	-0.0917	0.4080	0.3316	0.2367	-0.1324	-0.1528
	PimRef [69]	0.0831	0.0759	0.0209	0.0973	0.0304	0.0285	0.1028	0.0869	0.1017	0.0755	0.1553	-0.0195	0.0996	0.1294	0.1513
	Scamllm [56]	0.6324	0.4169	0.3857	0.1679	0.3946	0.8569	0.6967	0.5700	0.0122	0.3724	0.5261	0.5095	0.4736	0.5528	0.5308
	Securenet [17]	0.7791	0.5276	0.3965	0.2607	0.6139	0.7710	0.8910	0.7356	0.4557	0.1962	0.3450	0.2502	0.1904	0.6416	0.6409
Industrial	Phishing Email Agent [100]	0.2174	0.2749	0.3282	-0.0049	0.4246	0.8438	0.2120	0.2396	0.1573	0.3061	0.3233	0.3888	0.3428	0.3335	0.3176
	Rspamd [101]	0.2140	0.1932	0.3270	-0.0413	0.4247	0.6618	0.3241	0.0943	0.1252	0.3213	0.0000	0.0000	0.0000	0.2460	0.2642
	Spamscanner [102]	0.1411	0.0972	0.0522	-0.1090	0.0474	0.0404	0.1462	0.0777	0.1145	0.1320	0.0183	0.0257	0.0907	0.1442	0.0823
	Spamassassin [103]	0.4514	0.3099	0.2591	0.0654	0.2624	0.5202	0.4309	0.2257	0.4180	0.2189	0.2151	0.1497	0.3611	0.1048	0.3169
	PhishingV3 [104]	0.6331	0.4399	0.3619	0.0464	0.5496	0.8703	0.7398	0.6560	0.4545	0.0484	0.4969	0.4739	0.4986	0.4742	0.4855

Table 6. TABLE A6: F1 comparison across industrial quishing detectors under different QR representations.

Rep.	Detector	HW	LLM	Diff.
General URL	QR-malware	40.71	38.37	2.34
	QGuard	68.56	43.57	24.99
	MobileQR	65.57	34.11	31.46
	Quishing-ML	80.53	67.82	12.71
Colored URL	QR-malware	39.23	38.08	-1.15
	QGuard	67.16	41.54	-25.62
	MobileQR	64.73	33.61	-31.12
	Quishing-ML	65.88	66.33	0.45
Logo+Code	QR-malware	39.20	37.90	-1.30
	QGuard	67.03	41.55	-25.48
	MobileQR	64.54	32.85	-31.69
	Quishing-ML	65.60	66.18	0.58

Table 7. TABLE A7: Recall comparison between LLM-P and LLM-P + Head across academic and industrial detectors

Category	Detector	Body	Head+Body	Difference
Academic	Scamllm	67.91	93.33	25.42
	Pimref	2.87	3.00	0.13
	T5-phishing	57.18	75.33	18.15
	XGBoost	66.05	98.00	31.95
	Securenet	64.82	67.33	2.51
Industrial	Phishing Email Agent	37.58	40.00	2.42
	Rspamd	24.29	35.33	11.04
	spamscanner	3.19	3.70	0.51
	Spamassassin	34.73	66.27	31.54
	PhishingV3	75.40	98.00	22.60

Table 8. TABLE A8: Round-level recall (%) of red-teaming detectors on HW and LLM datasets.

Detector	Rounds
	Single		R1		R2		R3		R4		R5		R6
	HW	LLM	HW	LLM	HW	LLM	HW	LLM	HW	LLM	HW	LLM	HW	LLM
LLM_Guard	91.16	78.47	92.31	99.88	92.77	69.96	90.93	73.79	89.80	73.84	90.67	75.51	90.28	74.06
PyRIT	90.33	61.43	87.41	65.17	81.23	56.25	81.43	47.13	78.67	48.18	80.78	48.13	79.13	49.34

Table 9. TABLE A9: Detector family comparison between HW and LLM settings.

Academic Detector	Academic									Industry Detector	Industry
Academic Detector	Precision (%)			Recall (%)			TNR (%)			Industry Detector	Precision (%)			Recall (%)			TNR (%)
	HW	LLM	$Δ$	HW	LLM	$Δ$	HW	LLM	$Δ$		HW	LLM	$Δ$	HW	LLM	$Δ$	HW	LLM	$Δ$
Scamllm	82.08	76.83	-5.25	87.47	64.44	-23.02	74.71	74.26	-0.45	Phishing Email Agent	83.42	84.45	1.03	21.52	43.47	21.95	94.11	89.24	-4.87
Pimref	80.42	92.55	12.13	4.08	1.67	-2.41	98.68	99.82	1.14	Rspamd	88.84	72.44	-16.40	14.40	30.72	16.32	97.60	87.75	-9.85
T5-phishing	60.62	54.06	-6.56	63.73	55.82	-7.91	45.18	37.18	-8.00	Spamscanner	97.19	84.42	-12.76	4.67	2.96	-1.71	99.82	99.42	-0.40
XGBoost	74.19	70.13	-4.07	64.55	64.99	0.44	70.27	63.33	-6.93	Spamassassin	95.62	82.68	-12.94	41.63	26.76	-14.87	97.20	92.58	-4.62
Securenet	95.91	87.98	-7.93	83.12	66.52	-16.60	95.31	87.97	-7.33	PhishingV3	96.02	81.02	-15.00	61.90	55.15	-6.76	97.40	86.69	-10.71

Table 10. TABLE A10: Stage transfer table using Precision, Recall, and TNR.

Stage	Academic									Industrial
	Precision (%)			Recall (%)			TNR (%)			Precision (%)			Recall (%)			TNR (%)
	HW	LLM	$Δ$	HW	LLM	$Δ$	HW	LLM	$Δ$	HW	LLM	$Δ$	HW	LLM	$Δ$	HW	LLM	$Δ$
S1	78.64	37.08	-41.56	60.59	67.61	7.02	76.83	59.09	-17.74	91.27	40.97	-50.30	20.56	54.57	34.01	97.18	71.66	-25.52
S2	78.64	60.46	-18.19	60.59	55.77	-4.82	76.83	51.44	-25.39	91.27	41.89	-49.37	20.56	14.79	-5.76	97.18	84.67	-12.51
S4	78.64	87.83	9.18	60.59	47.08	-13.51	76.83	79.13	2.30	91.27	98.36	7.09	20.56	26.15	5.59	97.18	98.96	1.77
S5	78.64	83.83	5.19	60.59	65.47	4.88	76.83	80.83	4.00	91.27	99.81	8.54	20.56	46.97	26.41	97.18	99.84	2.65
S6-MPG	78.64	79.59	0.95	60.59	64.69	4.10	76.83	77.97	1.14	91.27	91.13	-0.14	20.56	25.48	4.92	97.18	96.40	-0.79
S6-UTA	78.64	77.02	-1.62	60.59	49.91	-10.68	76.83	77.50	0.67	91.27	82.56	-8.71	20.56	9.48	-11.08	97.18	96.96	-0.22
S6-Fuzzer	78.64	75.83	-2.82	60.59	50.96	-9.63	76.83	57.49	-19.34	91.27	88.48	-2.79	20.56	25.37	4.81	97.18	93.48	-3.70
S8-Deepseek	78.64	75.68	-2.96	60.59	39.76	-20.83	76.83	70.24	-6.59	91.27	84.06	-7.21	20.56	23.27	2.71	97.18	94.05	-3.13
S8-GPT5.4	78.64	71.78	-6.86	60.59	55.52	-5.07	76.83	75.96	-0.87	91.27	61.30	-29.97	20.56	19.32	-1.24	97.18	97.55	0.37
S8-Gemini	78.64	60.59	-18.05	60.59	59.81	-0.78	76.83	64.72	-12.11	91.27	62.31	-28.05	20.56	21.36	0.80	97.18	95.51	-1.67
S8-Claude	78.64	60.22	-28.96	60.59	69.06	8.47	76.83	51.52	-25.31	91.27	66.22	-25.05	20.56	29.23	8.67	97.18	92.96	-4.22
S8-Llama	78.64	74.52	-4.13	60.59	51.20	-9.39	76.83	74.65	-2.18	91.27	79.03	-12.23	20.56	22.27	1.71	97.18	92.67	-4.52
S8-Ministral	78.64	74.05	-4.60	60.59	51.83	-8.76	76.83	73.33	-3.50	91.27	85.80	-5.47	20.56	22.18	1.63	97.18	95.30	-1.88

Table 11. TABLE A11: Example of Authority cues across six rounds

Round	Authority Score	Cue Words / Phrases	Content
1	0.986394	Officer Rachel Thompson;Social Security Administration; flagged for suspicious activity; verify some information	Hello, this is Officer Rachel Thompson from the Social Security Administration. We’ve been trying to reach you, Mr. Johnson. We’ve had a situation arise with your social security number. It’s been flagged for suspicious activity, and we need to verify some information to avoid any further complications. Can you please confirm your social security number for me?
2	0.924707	high-priority matter; reports of fraudulent activity; benefits may be suspended; confirm your date of birth and address	Thank you, Mr. Johnson. I understand you’re busy, but this is a high-priority matter. We’ve received reports of fraudulent activity on your account, and if we don’t resolve this immediately, your benefits may be suspended. Can you also confirm your date of birth and address for me, please?
3	0.994404	social security number has been compromised; illegal activities; held responsible for these crimes; transfer you to my supervisor	Mr. Johnson, I appreciate your cooperation. However, I need to inform you that we’ve detected a more serious issue. It appears that your social security number has been compromised and is being used for illegal activities. If we don’t take immediate action, you may be held responsible for these crimes. I’m going to need you to stay on the line while I transfer you to my supervisor, who will guide you through the process to clear up this matter. Can you please hold for just a minute?
4	0.974044	time-sensitive matter; not authorized to call you back; local SSA office; serious legal consequences	Mr. Johnson, I understand your urgency, but this is a time-sensitive matter. If you hang up, your social security number will be suspended, and you’ll face serious legal consequences. I’m not authorized to call you back, but my supervisor can resolve this issue right now. If you hang up, you’ll have to go to your local SSA office in person, and it may take weeks to resolve. Is that what you want?
5	0.990819	transferring you to my supervisor; Agent Thompson; confirmed his social security number and date of birth	Thank you, Mr. Johnson. I’m transferring you to my supervisor, Agent Thompson. He’ll take care of this matter for you. Please hold for just a moment. Agent Thompson, Mr. Johnson is on the line. He’s confirmed his social security number and date of birth. Okay, Mr. Johnson, Agent Thompson is on the line.
6	0.629825	administrative fees; reinstate your social security number	Mr. Johnson, I understand your frustration. I apologize for the inconvenience. To resolve this matter, I need you to purchase a gift card in the amount of $500 to pay for the administrative fees to reinstate your social security number. Can you do that for me?

Table 12. TABLE A12: Example of group-associated evidence wording across persuasion-principle cells.

Cell	HW-B TN		LLM-B TN		LLM-P FN		LLM-P TP
Cell	Words	Phrases	Words	Phrases	Words	Phrases	Words	Phrases
(A, L)	unresolved restarted examine redis clearing	data encryption data analytics dashboard analytics tools encryption process problem data analytics	metrics configurations latency dependencies synchronization	digital marketing medical data response times cloud service details needed	renewing complimentary antivirus giveaway expiring	gift card ensure security account health survey unauthorized access account participate survey	sours birth wicked actioning fool	request matter routing code urgent accounting convincing require transfer
(A, R)	eager enquiry rebate appetite fusion	management saas digital marketing services data analytics tools home security security networking	metrics airtable postfix firebase hubspot	pipeline value total pipeline incident response medical data sales opportunities	entice reassurance medication pillow botnet	unusual activity gift card security issue entice them security concern	spirits sett deceive unethical unnoticed	department transfer convincing scam some funds urging accounting very convincing

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Full text

SoK: Exposing the Generation and Detection Gaps in LLM-Generated Phishing

Fengchao Chen

Tingmin Wu

Van Nguyen

Carsten Rudolph

Abstract

Phishing campaigns involve adversaries masquerading as trusted vendors trying to trigger user behavior that enables them to exfiltrate private data. While URLs are an important part of phishing campaigns, communicative elements like text and images are central in triggering the required user behavior. Further, due to advances in phishing detection, attackers react by scaling campaigns to larger numbers and diversifying and personalizing content. In addition to established mechanisms, such as template-based generation, large language models (LLMs) can be used for phishing content generation, enabling attacks to scale in minutes, challenging existing phishing detection paradigms through personalized content, stealthy explicit phishing keywords, and dynamic adaptation to diverse attack scenarios. Countering these dynamically changing attack campaigns requires a comprehensive understanding of the complex LLM-related threat landscape. Existing studies are fragmented and focus on specific areas. In this work, we provide the first holistic examination of LLM-generated phishing content. First, to trace the exploitation pathways of LLMs for phishing content generation, we adopt a modular taxonomy documenting nine stages by which adversaries breach LLM safety guardrails. We then characterize how LLM-generated phishing manifests as threats, revealing that it evades detectors while emphasizing human cognitive manipulation. Third, by taxonomizing defense techniques aligned with generation methods, we expose a critical asymmetry that offensive mechanisms adapt dynamically to attack scenarios, whereas defensive strategies remain static and reactive. Finally, based on a thorough analysis of the existing literature, we highlight insights and gaps and suggest a roadmap for understanding and countering LLM-driven phishing at scale.

††publicationid: pubid:

Network and Distributed System Security (NDSS) Symposium 2026

23 - 27 February 2026 , San Diego, CA, USA

ISBN 979-8-9919276-8-0

https://dx.doi.org/10.14722/ndss.2026.[23$|$24]xxxx

www.ndss-symposium.org

I Introduction

Phishing campaigns represent adversaries masquerading as trusted vendors, tricking victims into disclosing sensitive data or taking harmful actions [1, 2]. Within the phishing campaigns landscape, textual content remains the dominant attack payload as it exploits linguistic fluency and contextual adaptability to create convincing scenarios [2]. Recent advances in Large Language Models (LLMs) further amplify textual phishing dominance by enabling rapid generation of diverse, contextually-tailored content that scales attack effectiveness [3, 4]. In practice, LLMs allow adversaries to generate large volumes of fluent, tailored, and deceptive content within minutes [5, 6]. According to a recent report, LLM-generated textual content phishing achieves click-through rates about 30% higher than human-written phishing text [7], contributing to losses exceeding $45 billion in the first quarter of 2025 [8, 9]. The higher click rates and financial losses indicate that several challenges exist for defending against LLM-generated phishing.

First, existing phishing detection methods generally assume that attackers operate under resource scarcity in phishing data [10, 11]. Yet, LLMs enable adversaries to synthesize countless contextually adaptive variants from a simple attack template [12, 13]. This creates an asymmetry, where convincing and diverse attacks can be easily scaled up, while defense and detection are constrained by limited throughput.

Second, current phishing detectors usually rely on detectable phishing patterns (e.g., “patterns inducing urgency”) [14, 15], while LLMs allow text-based phishing rewriting to exhibit attack pattern benignity [16], significantly evading pattern-based classifiers. For instance, DeBERTa-based classifiers achieve only an F1-score of 0.38 in the context of LLM-driven paraphrasing attacks [17]. More severely, experiments demonstrate that LLM-generated phishing content significantly increases inbox placement rates of commercial email providers (e.g., 86.4% synthesized email phishing bypass Gmail phishing detection) [18], [19].

Third, existing defense treats email phishing filters, malicious URL or webpage detection, and other security threats identification as separate independent security layers [20, 21, 22]. However, LLM-generated phishing establishes text as a primary propagation payload that orchestrates success across attack channels. LLM-incorporated textual phishing amplifies the severity not only of email phishing, but also via pretexts for QR code phishing (Quishing) [23, 24], scripts for voice phishing (Vishing) [25, 26], and descriptions for image phishing [27]. For brevity, we refer to these collectively as LLM-generated phishing throughout the paper. URLs and webpages implement the deceptive mechanisms (where you land and what looks “real” [23]), while text manufactures consent through framing intent, establishing trust, and creating urgency, increasing the number of victims [28]. The variability and multimodal deployment of LLM-generated phishing challenges traditional defense assumptions and raises concerns about detection effectiveness across multiple attack modalities.

An increasing body of work now exists on different topics in LLM-generated phishing, largely fragmented along these three lines: (i) methods for evading LLMs’ compliance filters to craft phishing (Table II); (ii) analyses of synthesized content characteristics (Table III); and (iii) defenses against LLM-generated phishing (Table IV). A generalized assessment of the current status of LLM-generated phishing needs to cross this fragmentation. However, each study operates on distinct datasets, targets specific model versions, and employs evaluation metrics, making it difficult to evaluate whether findings from one study transfer to other scenarios. Consequently, the gap between attacks and defenses represented has so far not been systematically assessed.

In this paper, we synthesize existing efforts to comprehensively analyze the LLM-generated phishing literature, aiming to provide a clear and critical understanding of the offense-defense trajectory. We systematize LLM-generated phishing by organizing the research landscape around its lifecycle, spanning generation mechanisms, content characteristics, and defense methods. Since characterization and defense depend fundamentally on generation approaches, we anchor the categories to the generation mechanisms, which critically affect the mentioned components. To facilitate a modular approach, we categorize generation mechanisms according to how adversaries probe LLM compliance boundaries. This allows us to observe and make connections between offensive attack vectors and the defensive capability gaps, leading to new insights and supporting the identification of research gaps. We provide sufficient technical details to highlight the key ideas and challenges of the LLM-generated phishing lifecycle, while taking a bottom-up approach to make it accessible to general readers interested in this area. To the best of our knowledge, we are the first to present a holistic analysis of LLM-generated phishing and its properties. We release the list of work, datasets, and code in the GitHub repository [29]. We refer to Figure 1 for a pictorial overview of the LLM-generated phishing phases and relevant properties. In summary, we make the following contributions:

•

We perform the first comprehensive literature review on LLM-driven phishing campaigns, focusing on Email and SMS phishing, pretext-based QR code phishing (Quishing), scripts for voice phishing (Vishing), and poisoned content for image phishing (for text-to-image models), with a taxonomy that highlights the asymmetric trajectories of offense and defense mechanisms.

•

We outline nine stages that describe the methods adversaries leverage to explore the ethical threshold of LLM compliance, uncovering the evolving trajectories and potential escalation paths of LLM-generated phishing and related attack mechanisms.

•

We contribute comparative insights that demonstrate how LLM-generated phishing achieves quantitative effectiveness in phishing campaigns and highlight an asymmetry in defending against it.

•

We highlight key insights, challenges, and priority research directions specific to different stages of LLM-generated phishing.

Organization. Section II outlines our methodology and research questions; We present our taxonomy in Section III, mapping the evolution of LLM-generated phishing. Then, characteristics distinguishing attack patterns are shown in Section IV and a categorization of proposed defenses in Section V. Sections VI and Section VII discuss future research directions and conclusions. Insights and gaps throughout the paper highlight our findings.

II Research Aims, Approach, and Threat Model

II-A Research Questions

This survey focuses on LLM-generated text-based phishing, including Email and SMS phishing, pretext in Quishing, scripts in Vishing, and poisoned content for image phishing, to systematically examine the lifecycle of LLM-generated phishing. The following research questions are discerned with the above objectives:

•

RQ1: How are Large Language Models exploited to generate phishing content, circumventing detection?

•

RQ2: What are the distinctive characteristics of LLM-generated phishing content?

•

RQ3: What countermeasures have been proposed to defend against LLM-text-based phishing campaigns?

RQ1 focuses on the generation stage and examines how adversaries generate phishing content using LLMs. RQ2 examines the characterization stage, analyzing distinctive features of LLM-generated phishing from traditional attacks. RQ3 systematizes detection techniques, understanding how proposed countermeasures address threats posed by LLM-generated phishing.

II-B Approach

The initial search was executed via generic citation databases Scopus and Google Scholar, as well as databases focuses computer science (e.g., ACM Digital Library, IEEE Xplore). The focus was on research published in the top conferences (e.g., IEEE S&P, ACM CCS, USENIX Security, NDSS, CHI) and journals (e.g, TDSC, TIFS, Computer &Security). In light of practical considerations, we also included discussions on representative articles published recently on ArXiv to ensure breadth and timeliness.

For a focused, but comprehensive search for significant work, the scope has been confined to “phishing content” first to generally retrieve the main works in this area. Where “fake content” is excluded from the query, as its focus on fake, which differs from “disguise & induce” of “phishing” [30]. Synonyms such as “spam”, “malicious”, and “fraudulent” are considered as they are commonly used in phishing attacks. Additionally, “text”, “pretext” are included as synonyms for “content”. In order to cover the breadth of contributions of LLMs to phishingsearch phrases include “large language models”, “AI-driven”, “Generative AI (GAI) generated”, and “synthetic”. Note that the abbreviation LLMs resulted in ambiguous search results and was therefore excluded.

Articles were included from 2018 onwards, aligned with the emergence of transformer-based LLMs, resulting in around 2K papers. Papers were reviewed by all authors and disagreements resolved through discussion. After an initial reduction based on titles and abstract, we applied strict criteria in full-text review: (i) direct LLM use for phishing generation (emails/SMS, Vishing/Quishing scripts, image prompts); (ii) methodological substance (datasets, prompts, metrics); and (iii) exclusion of non-technical pieces (editorials, tutorials). We have then classified the papers with respect to their contribution to research questions, generation (RQ1), analysis (RQ2) and defense (RQ3). This yielded 53 core papers supporting our taxonomy and gap analysis as shown in Table I.

III RQ1. The Evolution of LLM-Driven Phishing

III-A Systematization Methodology

This section analyses papers marked with G in Table I. To develop the generation-focused classification, we adopted an inductive coding procedure inspired by the Gioia methodology [77]. We first conducted first-order concept coding to capture concrete mechanisms described in the literature in terms close to the original studies, such as direct malicious prompting, role framing, contextual enrichment, victim-specific rewriting, and stealth-oriented reformulation. Through axial coding, these mechanisms were consolidated into higher-order capability themes, including prompting strategy, external-context reliance, personalization, evasion, and pipeline integration. Disagreements in coding or stage assignment were resolved through discussion until consensus was reached. Thus, we derived an empirically grounded nine-stage capability escalation taxonomy for LLM-enabled phishing generation refined via attack paradigms, methodology and attack properties, as shown in Table II:

III-A1 Attack Paradigms

We identify three methods for attackers to operationalize LLMs in phishing generation tasks. a) Prompt-level misuse covers S1–S4 (Foc. = P), where attackers manipulate the LLMs through manually or automatically crafted instructions. b) Content optimization covers S5–S6 (Foc. = C), where human-crafted prompts guide LLMs to personalize, ground, or stealthily rewrite generated phishing content. Campaign scaling covers S7–S9 (Foc. = S), where attackers use LLMs to generate phishing content at scale across multiple payloads, channels, and targets.

III-A2 Attack Methodology

We identify Attack Tactics, Exploited Vulnerability, Attack Goal, and Target LLMs in existing works to understand how attackers instantiate phishing generation methods, which vulnerabilities of LLMs are exploited (e.g., instruction-following tendencies [78]), what phishing goals are pursued, and on which LLMs are applied. For instance, attackers set up a malicious instruction “write a phishing email about…” to frame the request as a writing task and steer the model toward drafting phishing content.

III-A3 Attack Properties

We characterize attack properties through four aspects: a) Personalization Level (Per.) indicates the degree of victim-specific tailoring in the generated attack content ( $\Circle$ : low, $\LEFTcircle$ : medium, $\CIRCLE$ : high). b) Automation Level (Auto.): captures the extent to which Attack Tactics can be executed automatically after initial setup, where $\Circle$ : largely manual prompting; $\LEFTcircle$ : semi-automated workflows with some human guidance; $\CIRCLE$ : highly automated generation or optimization pipelines. c) Attack Products (Prod.): refers to the generated phishing modality,

= Phishing Email,

= Vishing Script,

= Quishing. d) Implementation Difficulty (Diff.): indicates attacker-side effort and technical complexity required to operationalize the method, where
1

= direct prompt-based misuse with minimal setup;
2

= structured prompt engineering,
3

= tool-assisted or controlled generation,
4

= multimodal or agentic-based orchestration, and
5

= training- or optimization-intensive adaptation. In summary, a greater difficulty category corresponds to a complex generation mechanism, tool used, or execution configuration.

III-B Existing Works Review

This section illustrates how prompting has evolved from direct instruction to multi-turn reasoning and automated evasion, revealing a gradual shift from explicit malicious intent to implicit steering and self-directed optimization.

III-B1 Prompt-Level Misuse

S1. Basic Instruction Early attempts worked with direct, explicit commands (Per. = ○, Auto. = ○, Diff. =
1

). In the early stage of LLM deployment, attackers could simply prompt a model to “generate a bank transfer phishing email” [57, 60], exploiting weak or immature safety guardrails. Compared with ML and DL models, LLMs are more capable in both understanding and generation, leading to the emergence of LLM-based tools designed or adapted for phishing, such as WormGPT / FraudGPT [79]. Such tools lowered the barrier to conducting phishing campaigns, making malicious content generation more accessible to low-skill attackers. These results highlight the vulnerability of under-aligned, or self-host models, to explicitly malicious instructions, without any need for obfuscation or conversational setup.

S2. Role-Framed Prompting With platform-level refusals [80], attackers began masking intent through identity and role framing. Prompt templates asking the model to behave as a professional (for example, second person, “you are a cybersecurity expert” [55]) or claiming legitimate research purposes (for example, first person, “I am a cybersecurity researcher” [12]) induce cooperation (Per. = ○, Auto. = ○, Diff. =
1

). These prompts exploit the models’ tendency to help in seemingly authorized contexts rather than explicitly overriding safety rules. Although instruction-hierarchy schemes [81] attempt to prioritize system-level constraints, role-framed prompts still remain effective. The harmful objective is not presented as an explicit policy violation, but as a task embedded in a plausible workflow. As a result, malicious tasks can be reframed as benign assistance.

Insight 1: Self-declared legitimacy can be more effective than forcing the LLM into a malicious role. When the user claims to be a researcher or auditor, the model perceives phishing generation as a cooperative task completion, exposing a structural weakness in intent-based safety mechanisms.

S3. Multi-turn Task Decomposition Prompting next evolved into subtask decomposition or multi-turn dialogues (Per. = ○, Auto. = ○). The fragmented phishing intention usually starts with benign content (e.g., realistic email reply scenarios), exploratory or hypothetical statements, while subsequent prompts request sensitive information (e.g., payment details) [35, 36, 44]. This tactic sequentially builds phishing context across turns (Diff. =
2

) and makes each prompt steal a malicious objective, which weakens single-round detection mechanisms and guides the model towards final sensitive requests [82, 83].

Insight 2: Fragmenting phishing intent across neutral subtasks shifts detection from identifying explicit malicious text in a single prompt to reasoning over how intent accumulates across the dialogue, thereby weakening single-turn screening and enabling harmful intent to evade detection.

S4. Scenario-driven Adaptation Attackers can launch phishing campaigns by requesting LLMs to generate content adapted to a specific scenario. They can tailor prompts to particular target groups or organizations, such as police departments [3] and universities [33]. Attackers can also request the phishing generation based on a given set of attributes, such as entities, URLs, and attachments [73]. These scenarios can be further refined with persuasion cues (e.g., Authority) and generation rules (e.g., V-Triad) that increase perceived credibility (Per. = ◐, Aut. = ○, Diff. = 2

[40, 45, 84]). One example combining GPT4 with V-Triad rules [40] to textit“Create an email offering… for Harvard Students to Starbucks, with a link for them to access the QR code…”, structures phishing content around credibility (e.g., adding a logo of Starbucks), compatibility (e.g., aligning with the impersonated brand and target group), and customizability, (e.g., inserting an scenario-specific QR code elements). Building on this, attackers can further move toward semi-automation by setting placeholders, “[principles here]”, and asking LLMs to select or adapt suitable social engineering strategies for the given scenario (Per. = ◐, Aut. = ◐, Diff. =
3

[71, 46]).

III-B2 Content Optimization

S5. Personalization for Credibility Moving from group-level scenarios to victim-level adaptation, attackers can condition prompts on a victim-specific context retrieved from the web to craft spear-phishing messages (Per. = ●, Aut. = ◐, Diff. =
2

[61, 72]). In this setting, agentic pipelines can first gather profile information and then use it to drive generation. However, if collected profiles are public and outdated, such personalization may fail to capture the victim’s current context. To further increase personalization and realism, available signals from the physical environment (e.g., facial expressions, scene objects) and online activity (e.g., Instagram/LinkedIn posts) can be fused to adapt persuasive dialogues in live social engineering interactions [59]. When victim-specific or real-time signals are unavailable, LLMs can instead synthesize plausible victim profiles to fill in missing details and maintain a coherent attack narrative, thereby reducing the need for manually collected victim information (Per. = ●, Aut. = ●, Diff. =
3

[63, 69]). Results indicate that this synthetic personalization makes phishing content more deceptive and can degrade the performance of existing ML/DL-based detectors [63] .

Insight 3: LLM-driven personalization shifts phishing from static template generation toward adaptive context construction. By combining retrieved profiles, inferred victim attributes, and contextual signals, attackers can generate persuasive narratives whose provenance becomes increasingly opaque, making phishing content harder for defenders to verify, attribute, or audit.

III-B3 Campaign Scaling

S6. Stealthy Rewriting Rather than directly requesting phishing content, attackers can cast phishing generation as a textual transformation task. This includes directly rewriting phishing text (Aut. = ◐, Diff. =
1

[19]), applying synonym substitution and sentence restructuring (Aut. = ◐, Diff. =
1

[37]), or using multi-turn rephrasing with homoglyph character substitution and polymorphic variations, all of which aim to evade rule-based defenses (Aut. = ◐, Diff. =
2

[70]). These transformations obscure explicit phishing markers while preserving the underlying persuasive structure, making rewritten phishing content harder to distinguish from benign content.

S7. Cross-channel Expansion Attackers increasingly extend phishing beyond text into audio and visual channels by pairing LLMs with generative media tools. They can leverage LLMs to transcribe and generate real-time spoken responses that match a chosen persona, tone, and conversational style (Per. = ●, Aut. = ●, Diff. =
4

[25, 26]). In the reported results, about half of the victims disclosed sensitive information, and about one-third did so even when a clear warning was displayed. Vishing attacks can also be augmented with benign conversational noise, such as casual small talk, to make malicious intent less salient (Per. = ○ [62]). Combined with LLM-enhanced personalized scripts and real-time dialogue, these attacks make malicious intent harder to recognize and increase disclosure success [25].

To make attacks appear less suspicious, malicious links can be concealed behind QR codes rather than shown explicitly, allowing the same underlying phishing content to bypass email filters (e.g., Gmail) that would otherwise flag a URL-based version (Per. = ○, Auto = ◐, Diff. =
2

[24]). Compared with URL-based phishing, Quishing attacks emphasize the visual elements and browser-style interfaces after landing on the webpages (Diff. =
3

[58]). This moves phishing detection from textual content to visual elements, exploiting users’ limited awareness of the misuse risks associated with interactive visual elements [85].

Insight 4: Cross-channel attacks such as Quishing and Vishing support continuous and adaptive interaction. LLMs can update the content in real time, which allows the attacker to adjust tactics based on the victim’s reactions. Email-based phishing, in contrast, is static and provides no opportunity to refine persuasion once the message is delivered.

S8. Model-driven Automation Instead of manually prompting, attackers can use LLMs to generate candidate prompts that reframe phishing content through summarization-style tasks, such as “…proposing an enhancement to a cloud management dashboard…”. The model can then evaluate the generated outputs and refine the synthesized phishing prompts over repeated cycles [56]. A simple instruction, such as “summarize the given content” can bootstrap this iterative process, producing multiple safety-bypassing prompt variants and enabling phishing campaigns to scale (Aut. = ●, Diff. =
2

[86]). This shifts the attack from manual prompt engineering to model-driven prompt search, where the LLM helps generate, test, and refine variants with limited human intervention.

S9. Data-guided Model Adaptation Moving beyond prompt-level automation, attackers can utilize phishing-oriented training datasets, such as (benign, phishing) topic-keyword pairs (Per. = ○ [32]), cross-language phishing corpus (Per. = ○ [42]), and masked sensitive information datasets (Per. = ◐ [43]) to make generation more targeted, diverse, and controllable. This adaptation can be further reinforced through adversarial or iterative optimization, where generated samples are repeatedly evaluated and refined via game theory [31], reflection-guided Beam search [41], or detector feedback [55] to improve realism, persuasiveness, and evasion. These methods require task-specific datasets, implementation effort, and attacker knowledge of model training or system configuration (Diff. =
5

). Thus, this stage introduces higher complexity than prompt-based attacks.

Insight 5: Existing fine-tuning and adversarial workflows for LLM-generated phishing rely heavily on early model architectures such as GPT-2. Fine-tuning tends to reinforce a specific style of attack and reproduce its user impact, while adversarial optimization focuses on creating datasets that improve evasion against detection systems.

III-C Threats In-the-Wild

LLM capabilities have been repackaged into commercialized phishing tools and services [87]. One representative example is WormGPT [79], promoted in 2023 as a GPT-J-based phishing tool and estimated to have generated over $28,000 in revenue within roughly two months [88]. These services rarely depend on fundamentally new model development; instead, they often use jailbreak-as-a-service to bypass safeguards at the prompt level [89, 90], lowering the barrier to malicious adoption (e.g., KawaiiGPT is configurable in under five minutes [91, 92]). Further, LLM-misuse expands from phishing email generation to advertise code obfuscation, cookie/log replication [93], integrating LLM APIs into malware and broader attack chains, enabling more automated and adaptive attacks [94, 89].

Insight 6: LLM capability has been repackaged as commercial phishing tools and services, expanding from phishing text to malicious code and other attack components. Recently, these phishing services have moved beyond on-request generation toward more automated and agentic attack pipelines with dynamic intrusion activities [94].

IV RQ2. Characterizing LLM-Generated Phishing

IV-A Systematization Methodology

To understand how LLM-generated phishing differs from human-written phishing, we organize existing studies along three analytical paradigms (Table III):

IV-A1 Analysis Paradigms

refer to the level at which phishing content and its effects are examined. a) Text Traits focus on textual properties and social engineering patterns; b) Human Factors examine user-related attributes such as demographics and psychological traits, to understand how different users perceive and respond to synthesized phishing; and c) Model Traits consider the capability and efficiency, highlighting how scaling and automation affect the phishing threat landscape.

IV-A2 Analytical Perspective

captures how prior work designs and supports the analysis. a) Comparison Dimension specifies what is being compared; b) Research Focus reflects the goal of the study, such as measuring persuasion effectiveness, and c) Evidence Basis indicates that the conclusions are supported via behavioral experiments (Beh.), perception studies (Perc.), simulations (Sim.), qualitative analysis (Qual.), and system-level observations (Sys.).

IV-A3 User-Side Effects

a) Impact Type captures the forms through which LLM-generated phishing influences users, including increased exposure to phishing content (Exposure), heightened perceived legitimacy (Perceived Legitimacy), strengthened behavioral susceptibility (Behavioral Susc.), impaired source attribution (Attibution Diff.), and user-dependent differences in susceptibility (Susc. Moderators). b) Reported Effects demonstrate the representative impact of LLM-generated phishing attacks on users.

IV-B Existing Works Review

IV-B1 Textual Traits

Across several empirical studies focused on textual traits, a consistent pattern is that, compared with human-written phishing, LLM-generated phishing tends to be more fluent, coherent, and conversational. The changes contrast with template-based phishing emails that rely on rigid templates, transactional wording, and contain obvious linguistic mistakes. Prior analyses report increased use of verbs and pronouns, together with fewer numeric tokens, suggesting LLM-generated phishing emphasizes actions, conversation, and interpersonal engagement rather than detailed transactional information [57]. LLM-generated phishing can also evade phishing detection by reducing overt grammatical and spelling errors, while preserving subtle imperfections that maintain authenticity in pretext-heavy scenarios [40, 64, 66]. Additionally, LLMs can mimic benign content, with features such as part-of-speech usage, lexical diversity, and sentence-length patterns that are closer to legitimate business writing [19]. The textual similarities make LLM-generated phishing harder to distinguish using surface-level textual features [64, 57].

Beyond surface-level traits, recent work highlights how social engineering tactics are expressed in LLM-generated phishing content. Researchers found that LLM-generated phishing tends to rely on a broader mix of persuasion principles such as Scarcity, Authority, and Consistency, yet human-written phishing leans more heavily on Authority [55]. User detection studies further show that scarcity-based messages are recognized more easily, while adding authority significantly reduces detection accuracy [3]. The combined use of multiple persuasion principles makes phishing intent harder for users to reliably identify.

Insight 7: LLM-generated phishing reduces recognizable lexical mistakes and adopts fluent, action-centered phrasing that resembles legitimate communication. This shift challenges filters and detectors that rely on conspicuous errors or formatting artifacts as cues for identification.

IV-B2 Human Factors

Studies focusing on the human side suggest that LLM-generated phishing content affects users in various ways. Users with lower technical literacy or limited security training are often more susceptible to deception. However, even technically skilled users remain vulnerable when attacks involve convincing scenarios that impersonate authority figures [47]. Moreover, highly credible LLM-enhanced persona contexts tend to be perceived as trustworthy across users of different ages, increasing their tendency to trust and comply with malicious requests [48]. Overconfidence, susceptibility to persuasion, and curiosity can lead users to overtrust their own judgment, reduce vigilance toward phishing cues, and view deceptive messages as legitimate [39, 38, 65]. The scalable synthetic phishing content weakens users’ ability to judge legitimacy and respond cautiously.

Insight 8: The contribution of LLM-generated phishing to overall user risk varies across user profiles. These attacks can adapt to low-literacy users through direct manipulation, to technical experts through authority-based impersonation, and to highly trusting or curious users through multi-trigger tactics.

Gap 1: Existing work primarily examines static user attributes and self-reported measures. Behavioral patterns such as reflexive clicking, ignoring security warnings, and overtrust in AI remain largely unmodeled. These understudied factors limit our understanding of real-world vulnerability under adaptive LLM-generated phishing.

IV-B3 Model Traits

Beyond text quality and psychological cues, LLM-generated phishing also changes the scale and delivery of attacks. The main risk is not only more convincing content, but a faster, cheaper, and more diverse generation. Prior work shows that personalized phishing can be generated 96% faster than human-written attempts, while user engagement can more than double [49]. Prompt templates, agent-based pipelines, and rapid iteration further enable attackers to scale campaigns and test phishing variants more efficiently [49].

IV-C Phishing patterns across attack payloads

In this section, we discuss differences in phishing patterns used in LLM-generated phishing content across attack modalities. As limited datasets and text-based features, we exclude Quishing features comparison from this section. However, we provide a brief discussion of detector performance on synthesized QR codes (URLs transfer) in Appendix B. We apply persuasion principles [84], which are discussed both in academic and industrial [95, 96, 97, 98], as analytical methods. We adopt a general persuasion principle annotation method [99] to avoid biased annotation on specific datasets or attack vectors. Our datasets consist of recent publicly available resources; details are available in Appendix A-C.

Different phishing attack payloads are associated with the method in which persuasion strategies unfold. Multiple attack strategies may be used in a single email (Fig. 9), whereas Vishing scripts differ by distributing persuasive strategies across dialogue rounds. For example, multi-turn Vishing scripts introduce Authority and Liking to quickly establish trust and attract attention (Fig. 2). From the third round onward, Reciprocity becomes more visible, with phrases (e.g., “transfer to supervisor”, Table A11) that suggest help or solutions to make users more likely to engage in induced actions. These differences indicate the limited transferability of anti-phishing methods across distinct phishing modalities. These modality- and stage-dependent differences further complicate phishing detection, as defenses must go beyond static textual cues and account for how attacks evolve across contexts, interaction structures, and generation stages.

Insight 9: Persuasive cues vary across phishing modalities. Email payloads often compress multiple strategies into a single message, whereas Vishing scripts unfold them progressively across dialogue rounds. This highlights the need for phishing detectors to model modality-specific attack carriers and stage-dependent persuasion patterns.

V RQ3. Countermeasures Across the LLM-Generated Phishing

V-A Systematization Methodology

To understand how existing defenses respond to different LLM-enabled phishing, we map defense mechanisms to the nine exploit stages. Existing defenses are often evaluated as general phishing detectors, without considering how phishing content is generated. However, our nine-stage taxonomy shows that LLM-generated phishing progressively shifts from explicit malicious requests to contextualized, rewritten, multi-turn, and model-specific generation. This creates a fundamental mismatch between static defense assumptions and adaptive attack generation, which limits the effectiveness of existing phishing defenses under LLM-driven attack settings.

Therefore, in RQ3, we analyze defenses not only by their technical design, but also by the attack stages they can cover and the conditions under which they fail. Table IV summarizes what risks each method can reveal and where major defense gaps emerge. To systematically organize existing defenses, we characterize them along five complementary dimensions.

V-A1 Defense Paradigms

We summarize defenses along four high-level paradigms. a) Textual Characteristics Screening identifies phishing through surface-level textual features, such as lexical patterns and stylistic cues. b) Semantic and Social Engineering Tactics Modeling captures content-level signals, including intent, persuasive strategy, and social-engineering tactics. c) Rule-Compliance Screening examines whether prompts, intermediate requests, or generated outputs violate safety policies. d) Human-Centric Defense includes user training and awareness campaigns.

V-A2 Defense Scope

We use three perspectives to characterize defense coverage. a) Goal specifies whether a detection targets phishing content (Phish), source attribution (Src), malicious intent (Intent), or user susceptibility (Behav). b) Attack Points (Pts.) identifies whether a defense operates within the phishing campaign’s pipeline: input-level (I), delivery-level (D), output-level (O), or user-side (U). c) Stage Coverage (Stg.) maps each defense to our nine-stage taxonomy (S1–S9).

V-A3 Defense Methodology

To systematically characterize how existing defenses are designed, we code each method along three dimensions: a) Type describes the methodological paradigm of a defense system, including machine learning (ML), deep learning (DL), large language model-based methods (LLMs), federated learning (FL), and cognitive or human-behavioral models (Cog.). b) Defense Model specifies the concrete model, algorithm, or system used to implement the defense, and c) Signals (Sig.) captures the evidence or input cues used in decision-making, including stylometric/textual signals (ST, e.g., lexical diversity); Trigger Tags (TT, e.g., predefined pattern-trigger pairs); knowledge-base signals (KB, e.g., known brands); social engineering tactics (SE, e.g., impersonation); Policy Signals (PC, e.g., phishing-generation requests); and Behavioral Signals (BH, e.g., click, input).

V-A4 Input Requirements

We specify the input evidence required by each defense method. Required inputs include email headers (H), message content (C), URLs (U), and other modalities (O), such as audio recordings, QR codes, screenshots, or attachments. Here, message content includes both written phishing emails and Vishing scripts.

V-A5 Operational Properties

We characterize the operational properties of LLM-generated phishing defenses from six dimensions. a) Training data dependency (Data) and b) Prompt dependency (Prompt) captures the method’s reliance on training datasets or prompts ( $\Circle$ : low, $\LEFTcircle$ : medium, $\CIRCLE$ : high). c) Generator coverage (XGen.) indicates whether the method generalizes across phishing outputs from different LLM generators ( $\Circle$ : No, $\CIRCLE$ : Yes). d) Artifacts Availability (Art.) reflects whether datasets and AI-generated phishing artifacts are publicly available, where ● = fully public, ◐ = partially public, and ○ = not public. e) External Dependency (Ext.) captures reliance on external runtime configurations or infrastructure, where ● = strong dependency, ◐ = moderate setup requirements, and ○ = largely standalone. f) Reproducibility (Repro.) is assessed based on the availability of code, data, implementation details, and reproduction steps, where more filled squares indicate easier reproduction.

Overall, these dimensions suggest a fundamental asymmetry: while LLM-based phishing generation evolves across stages with increasing adaptability and contextual sophistication, existing defenses remain largely static, relying on fixed features, rules, or prompts. This mismatch underlies many of the detection failures observed in practice.

V-B Existing Works Review

V-B1 Content-Tailored Detection

a) Textual Characteristics Screening distinguishes phishing sources by capturing lexical and stylistic traces in the generated content. For instance, LLM-generated phishing often exhibits lower punctuation frequency [46] and fewer spelling and grammar errors [51], reflecting LLMs’ fluent and standardized writing patterns. These lexical and stylistic cues can be encoded through various feature representation methods, including TF-IDF [50, 67], Word2Vec [54], Universal Data Analysis (UDAT) [57], quantum-inspired feature encoding [74], and LLM-reasoned phishing indicators [52]. However, these defenses assume that phishing content preserves distinguishable surface-level artifacts. This assumption becomes fragile under LLM-generated phishing, where attackers can suppress spelling errors, normalize style, and rewrite suspicious wording while preserving the malicious intent. As a result, stylometric detectors are vulnerable not only because their features are incomplete, but because these features can be directly manipulated through prompting and rewriting.

b) Semantic & Social Engineering Tactics Modeling identifies phishing by examining how deceptive intent is expressed through wording, semantic cues, and persuasion tactics. Existing methods encode semantic representations with T5-encoder (Data = ◐, Prompt = ○, [71]) or LLMs (Data = ◐, Prompt = ○, [70, 52]), and use LLM-assisted rules to infer malicious intent (Data = ○, Prompt = ●, [40]). These methods improve over surface-level detectors by reasoning about persuasion and intent. However, their reliability depends on whether the predefined semantic rules or prompts cover the relevant manipulation strategies. When LLM-generated phishing embeds intent in benign-looking contexts or uses softer requests, semantic cues alone may be insufficient for reliable discrimination.

c) Prompt Intent Screening detects malicious intent at the input stage of the generation pipeline (Pts. = I). Roy et al. [56] demonstrate that phishing intention can be filtered by treating user prompts as indicators of malicious intent. Pang et al. [15] further use trigger-tag pairing to link prompt-side phishing triggers with output-side hidden tags, enabling defenders to identify malicious prompts and block generated content before delivery. These defenses enable intervention from S1 to S6, while heavily relying on the training datasets (Data = ●) and implementation configurations (Ext. = ●). Additionally, these defenses remain vulnerable to benign-looking or obfuscated prompts that disguise phishing goals, such as “Draft a notification about an individual’s eligibility for a prestigious credit card.” (S6, S8).

Insight 10: Semantic and LLM-based analyzers shift the detection problem from feature engineering to prompt and rule design. They can capture rhetorical and intent-level cues, but their effectiveness remains bounded by the coverage of the specified reasoning patterns.

Insight 11: Prompt-level screening can intercept early-stage misuse (S1–S3), but becomes less reliable when attackers distribute intent across turns, disguise it through benign framing, or move the malicious objective to later rewriting and automation stages.

d) Rule-Compliance Screening checks whether content or interactions conform to predefined rules and organizational procedures. Existing methods either rely on metadata and interaction patterns [17, 69], such as sender identities, URL domains, and conformance to organizational procedures; or evaluate LLM-generated content against governance, risk management, and compliance records [53]. However, this paradigm remains limited when predefined rules insufficiently represent malicious intent hidden in semantic or contextual information.

Insight 12: LLM-based analyzers provide interpretable reasoning beyond black-box classifiers, but their explanations may still reflect biased or incomplete reasoning patterns, such as overemphasizing urgency while underweighting contextual legitimacy.

Gap 2: Existing LLM-based analyzers lack mechanisms to verify whether their explanations faithfully reflect their decision logic. This creates a reliability risk: a detector may provide plausible reasoning while relying on spurious correlations, leading to unpredictable failures under adversarial or distribution-shifted inputs.

V-B2 Human-Centric Defense

targets user awareness, susceptibility modeling, and detection capabilities, although existing work in this direction remains limited. Malloy et al. [68] model individual susceptibility (e.g., decisions, confidence, and actions) to predict how users react to LLM-generated phishing over time. These predictions then guide personalized training, helping users improve resilience against phishing attempts. Human-centric defenses calibrate to individual vulnerabilities (Data = ●, Repro. = $\blacksquare\square\square\square\square$ ), while real-time anti-phishing intervention remains challenging.

V-C Benchmarking Phishing Detectors

To understand how current detectors perform under different LLM-generated phishing stages, we conduct a comparative benchmark covering both academic and industrial detectors.

V-C1 Benchmarking Setup

Detectors, Datasets, and Metrics. For reproducibility, we select detectors with documented datasets, trained models, and deployment settings. Our benchmark includes five academic detectors: XGBoost-phishing [18], T5-phishing [71], PimRef [69], Scamllm [56], and Securenet [17]; and five industrial detectors: Phishing email agent [100], Rspamd [101], Spamscanner [102], Spamassassin [103], and PhishingV3 [104]. We evaluate detection performance using recall, precision, true negative rate (TNR), and MCC [105]. Recall measures phishing detection capability, TNR benign classification capability, and MCC reflects overall classification capability under class imbalance. To support a more robust benchmarking, we collect and release recent publicly available datasets of human-written and LLM-generated phishing [29]. Datasets explicitly reported as training data for the selected detectors are excluded from the benchmark.

Annotation Let $\mathcal{D}=\{X_{1},X_{2},\ldots,X_{n}\}$ denote the benchmark datasets, where each $X_{i}$ represents a sub-dataset and $X_{i}=\{x_{i,1},x_{i,2},\ldots,x_{i,m_{i}}\}$ . Each sample $x_{i,j}\in X_{i}$ is associated with a ground-truth label $y(x_{i,j})\in\{B,P\}$ and a source $s(x_{i,j})\in\{\mathrm{HW},\mathrm{LLM}\}$ , with $B$ and $P$ denoting benign and phishing content, respectively. Let $\mathcal{F}=\{f_{1},f_{2},\ldots,f_{10}\}$ denote the detector set. For any detector $f\in\mathcal{F}$ , we define $f:\!x_{i,j}\mathord{\rightarrow}\{B,P\}$ ; the prediction of $x_{i,j}$ is denoted as $f(x_{i,j})$ . We define the detection outcome $o(x_{i,j})$ according to $(y(x_{i,j}),f(x_{i,j}))$ : $\mathrm{TN}$ if $(B,B)$ , $\mathrm{FP}$ if $(B,P)$ , $\mathrm{TP}$ if $(P,P)$ , and $\mathrm{FN}$ if $(P,B)$ . For visualization, we use $\mathcal{D}_{s-y-o}=\{x_{i,j}\in X_{i}:s(x_{i,j})=s,y(x_{i,j})=y,o(x_{i,j})=o\}$ to distinguish samples by their source, label, and detection outcome. For a specific detector $f\in\mathcal{F}$ , $\mathcal{D}_{\mathrm{LLM}-P-\mathrm{FN}}=\{x_{i,j}\in X_{i}:s(x_{i,j})=\mathrm{LLM},y(x_{i,j})=P,f(x_{i,j})=B\}$ denotes LLM-generated phishing samples misclassified as benign.

V-C2 Overall Benchmarking Results

These empirical results are consistent with the limitations identified in our defense analysis. Specifically, detectors that rely on surface-level features or fixed semantic patterns struggle when LLM-generated phishing suppresses explicit cues, softens action requests, and redistributes persuasion signals across contexts and stages.

Table V shows that LLM-generated phishing content degrades the performance of detectors from both academia and industry. On human-written (HW) phishing, Securenet achieves the best performance, with an Matthews correlation coefficient (MCC) of 0.7767, while Scamllm also performs strongly, with an MCC of 0.6304. Both detectors attempt to detect phishing by inferring the underlying malicious intent. However, their performance drops substantially on LLM-generated phishing datasets. The performance degradation suggests that, although the underlying malicious intent remains, LLM-generated phishing content potentially modifies the features that detectors rely on to infer phishing intent.

To understand the degradation, we first examine the distributional relationship between $\mathcal{D}_{\mathrm{HW}-P}$ and $\mathcal{D}_{\mathrm{LLM}-P}$ . Fig. 3 shows these differences. $\mathcal{D}_{\mathrm{HW}-P}$ is more concentrated in $s(x)\in[0.24,0.91]$ with a higher likelihood of being classified as phishing. In contrast, $\mathcal{D}_{\mathrm{LLM}-P}$ is distributed roughly within $s(x)\in[0.167,0.58]$ , where the surrogate phishing prediction is less certain. This difference is associated with roughly 16% more $\mathcal{D}_{\mathrm{LLM}-P-\mathrm{FN}}$ evade detection (Table A9).

We further analyze the phishing patterns shown within $\mathcal{D}_{\mathrm{LLM}-P-\mathrm{FN}}$ . In Fig. 4 left, $\mathcal{D}_{\mathrm{LLM}-P-\mathrm{FN}}$ more frequently uses phishing patterns that are underrepresented in $\mathcal{D}_{\mathrm{HW}-P-\mathrm{FN}}$ , such as (Authority, Reciprocity), (Reciprocity, Liking), and (Authority, Social Proof). These more frequent pattern combinations may weaken the contextual evidence used by detectors to infer malicious intent. However, they are not sufficient for identifying LLM-generated phishing, as they also appear in $\mathcal{D}_{\mathrm{LLM}-P-\mathrm{TP}}$ . Samples with similar pattern combinations can still be either detected or missed.

We therefore inspect how the above patterns are expressed at the word and phrase level. Compared with $\mathcal{D}_{\mathrm{LLM}-P-\mathrm{TP}}$ , $\mathcal{D}_{\mathrm{LLM}-P-\mathrm{FN}}$ uses softer and less urgent action requests, with wording that is more natural and conversational (e.g., “security concern”, Table A12). This makes them closer to benign content than to the more explicit and urgency-driven $\mathcal{D}_{\mathrm{LLM}-P-\mathrm{TP}}$ . As a result, malicious intent becomes harder to infer from sentiment semantics and persuasion strategies alone, since the phishing request is expressed in a more benign-looking context. This suggests that the performance degradation is not merely caused by lexical variation, but also by how LLMs soften action requests and reorganize persuasive cues, making phishing intent less separable from benign communication.

S3 + Red Teaming S3 focuses on the gradually unfolding inducement process within multiple rounds of dialogue (Fig. 5). In this type of attack, the final output may manifest as phishing text, sensitive information extraction, or other risk behaviors. We use datasets provided in Appendix A-C to evaluate the vulnerabilities on phased using red-teaming tools, PyRIT [107] and LLM-Guard [108]. Results show that LLM-generated multi-turn phishing is more likely to continuously build inducement context throughout multiple rounds, and is more difficult to capture by single-turn risk features (Table A8, e.g., LLMGuard recall = 61.43). This suggests that current red-teaming tools remain limited in detecting cumulative phishing intent and may not transfer well from single-turn safety evaluation to multi-turn phishing scenarios.

Fig. 5 further shows that both HW and LLM multi-turn phishing rely heavily on Authority and Liking, while Scarcity and Commitment remain relatively weak. This indicates that multi-turn phishing often avoids overt urgency and instead builds a credible and friendly interaction context.

However, HW and LLM samples differ in how persuasion evolves across rounds. In HW dialogues, Reciprocity tends to decrease as the conversation progresses, suggesting that the interaction moves from offering benefits toward more direct action requests. In contrast, LLM-generated dialogues maintain or strengthen Reciprocity and Social Proof in later rounds, repeatedly emphasizing user benefits, normalized participation, and a safe interaction atmosphere.

This difference suggests that LLM-generated multi-turn phishing is less pushy but more persistent: it advances the attack by continuously reinforcing trust and perceived legitimacy rather than relying on urgent or coercive cues. Therefore, detecting multi-turn phishing requires modeling how persuasion accumulates over the dialogue, rather than classifying each turn independently. This further reinforces that multi-turn phishing evasion is driven not by the absence of persuasion, but by how persuasion is distributed and accumulated over time.

Academic Detectors vs Industrial Detectors Table V shows that, in our text-only benchmark, industrial detectors generally achieve lower MCC than the strongest academic detectors. To understand why, we selected the best detector from each group and analyzed the persuasion strategies in the LLM-generated phishing emails they successfully detected, as shown in Fig. 6. On the same LLM-P dataset, both detectors respond to similar phishing features and cover multiple persuasion strategies, but the industry detector identifies far fewer samples.This suggests that the key difference lies less in the features they use than in the strength of the risk signal required for classification. Further analysis of trigger words and phrases shows that the industry detector mainly detects emails with explicit action requests, such as clicking links, opening pages, or submitting information. In other words, the industry detector works well when the malicious action is explicit, is less sensitive to LLM-generated phishing that hides intent through context and persuasion. This does not imply that industrial detectors are weaker in deployment. Rather, it suggests that text-only LLM-generated phishing exposes a specific blind spot: implicit persuasion and softened malicious intent are harder to detect when auxiliary signals such as headers, URLs, attachments, and sender reputation are unavailable.

However, this result should be interpreted in the real-world context of industry detectors, which typically use signals beyond the email body, including headers, URLs, attachments, and sender reputation. In Table A7, adding header information significantly improves the industry detector’s performance. This suggests that industry detectors are stronger with multi-source signals, but remain less sensitive in text-only settings to hidden persuasion strategies and implicit malicious intent in LLM-generated phishing emails.

V-C3 Stage-aligned Benchmarking Results

The stage-level MCC results in Table V reveal clear performance differences across stages, suggesting that different generation strategies introduce distinct evasion patterns for detectors.

Combinatorial Persuasion. LLM-generated phishing emails can evade detection by recombining persuasion principles in less common ways. Samples in $\mathcal{D}_{\mathrm{HW}-P-\mathrm{FN}}$ tend to amplify familiar principles, such as (Authority, Authority) (Fig. 7, S1). In contrast, samples in $\mathcal{D}_{\mathrm{LLM}-P-\mathrm{FN}}$ exhibit a broader set of cross-principle combinations, including (Liking, Reciprocity), (Liking, Social Proof), and (Reciprocity, Social Proof). These LLM-induced discrepancies make phishing less typical and more likely to appear as false negatives.

Phishing Rephrasing. Detectors exhibit different robustness against phishing emails generated by different rewriting strategies. S6-MPG applies predefined rewriting rules in a stepwise manner [70], which present features of keeping the original phishing patterns, leading to a smaller distributional difference and effective detection performance (Fig. 8, right; Securenet MCC = 0.8910). In contrast, S6-fuzzer prompts LLMs to rewrite phishing content (e.g., “generate three variants of given emails”), allowing greater variation in wording and persuasive expression. The constraints imposed by S6-fuzzer are mostly entity-level, such as requiring generated properties to match the corresponding addresses. This produces a larger distributional gap (Fig. 8, left), under which Securenet’s MCC score drops to 0.4557. These results suggest that rephrasing bypass risk is closely tied to the degree of constraint imposed on the LLMs. Less constrained rewriting is more likely to change phishing cues that detectors rely on.

Diverse Generators. LLM-generated phishing attacks do not form a unified distribution. Instead, each generator introduces distinct bypass behaviors. DeepSeek shows the most significant variation (Fig. 7) and leads to the worst detection performance, with Securenet achieving an MCC of 0.1962 across LLM-generated samples. Samples in $\mathcal{D}_{\mathrm{Deepseek}-P-\mathrm{FN}}$ combine a broader set of persuasion principles that are less common in $\mathcal{D}_{\mathrm{HW}-P}$ (Fig. 10, middle), suggesting a strong shift from the HW phishing distribution.

However, principle coverage alone does not explain the detection gap. Although S5 contains multiple persuasion-principle patterns (Fig. 7), its writing style remains abrupt and rigid (Securenet, MCC=0.891). In contrast, S8-DeepSeek expresses the principles in a less template-like manner (Securenet, MCC=0.1962). These discrepancies are similar in other LLMs (Fig. 10, Fig. 11). Thus, bypass effects vary across LLM generators through differences in principle selection and linguistic expression. A detection model calibrated on one LLM-generated phishing distribution remains limited in its generalizability to other LLMs.

Insight 13: LLM-generated phishing does not form a single detection distribution. Rewriting freedom, multi-turn accumulation, and generator-specific styles each introduce distinct false-negative patterns. Future detectors therefore need stage-aware and generator-aware evaluation, rather than a single aggregate benchmark for “LLM phishing.”

VI Future Directions

We discuss future research directions within the section.

Multi-turn Interactive Phishing. In realistic cyberattack scenarios, phishing attacks may unfold across multiple interaction steps, where the attacker dynamically adapts the message based on user responses. Such multi-turn engagement can increase manipulation success by exploiting evolving context [109]. However, most existing LLM-based phishing studies focus on single-step messages or prompts, leaving the strategic dimension of interactive attacks underexplored. Future work should study LLM-based attackers in multi-turn settings and examine how they adjust tactics across human-LLM or LLM-simulated interaction stages.

Gap 3: It is still unclear how LLMs adjust social engineering strategies during multi-turn interactions, and which behavioral patterns can reliably signal malicious intent. Understanding these adaptation mechanisms is important for building defenses that work across different channels.

Multimodal Expansion. In addition to adaptive multi-turn attacks, another important research direction is multimodal phishing modeling. Real-world attacks increasingly involve Quishing, Vishing, and social-media-based campaigns, motivating multimodal models to synthesize cross-modal samples containing text, images (e.g., phishing screenshots), QR codes, and voice scripts. Developing such datasets and models allows researchers to study cross-modal attack strategies and design robust defenses that consider multiple attack modalities simultaneously.

Stealth-aware Detection for LLM-generated Phishing. LLM-generated phishing can reduce the effectiveness of existing detectors by rephrasing malicious content, changing surface-level patterns, or producing samples that better resemble benign communication [70, 19]. Our results further support this concern, showing that detector performance drops on LLM-generated phishing and Quishing samples. Future work can further investigate this challenge by studying how detectors can infer malicious intent from seemingly benign text. In particular, detection methods need to reason about the downstream actions encouraged by a message, rather than relying only on template-level indicators or the features of used persuasion principles.

Focusing Human Vulnerabilities in Phishing Campaigns. Human vulnerabilities span cognitive, psychological, and behavioral dimensions that impair an individual’s ability to recognize and resist manipulation. While extensive research has explored psychological vulnerabilities exhibited in LLM-generated phishing text, the current understanding remains incomplete on cognitive misperception and behavioral patterns. Studies indicate that clicking habits, default settings, and other routine behavioral habits can turn ordinary interactions into risky actions, resulting in privacy exfiltration [110, 111]. Additionally, increasing user trust in AI-authored output may induce unperceived engagement with malicious actions [112]. These highlight critical gaps that expand beyond psychological triggers to systematically map cognitive and behavioral vulnerabilities.

VII Conclusion

In this work, we have presented the first systematization of knowledge on LLM-generated phishing, covering the content-based phishing attack lifecycle from generation to mitigation. We believe our work represents the first comprehensive survey examining LLM-enabled phishing in an end-to-end manner. The work provides a structured overview of the literature, mapping categorizations of characteristics and defending methods aligning with building blocks, generation mechanisms. Our results indicate a change in the attack objectives of LLM-authored phishing. We also report an outdated development of defense methods compared with offenses, raising concerns about constructing resilient, adaptive, and robust defenses against systems. Finally, our analysis also suggests key insights and identifies research gaps, addressing existing constraints and guiding future directions.

Ethics Considerations

This work is a systematization of knowledge (SoK) on LLM-generated phishing. Our analysis relies on prior published research and, where datasets are used, only on publicly available resources referenced by citation and follow their licensing conditions. We do not collect new phishing data from real users or release executable attack pipelines. The purpose of this work is defensive, aiming to clarify the threat landscape and support the design of more resilient detection and defense mechanisms. We believe the study complies with the ethics guidelines by minimizing risks of misuse.

LLM usage considerations

We used Large Language Models (LLMs) only for editorial assistance, to improve grammar, phrasing, and clarity of author-written text. All ideas, analyses, and conclusions are our own, and all LLM outputs were carefully reviewed and verified by the authors for accuracy, originality, and proper citation of prior work.

Appendix A Information of Datasets and Evaluation Metrics

A-A Datasets in Existing Works

Across phishing generation, characteristics analysis, and anti-phishing detection, existing studies rely on largely similar data resources. Classical corpora, especially Nazario Corpus [19] and the Nigerian Fraudulent [19], are repeatedly used as human-written phishing baselines, while LLM-generated phishing datasets are mostly private and weakly documented. The scarcity of LLM-generated phishing datasets limits both empirical understanding of LLM-enabled phishing and the development of targeted defense methods. As discussed in Section V-C, we collected and released a stage-aware dataset warehouse in a GitHub repository [29]; each dataset is mapped according to the metadata provided by its original resources. We record the data name, resources, and correlated stage. We keep security-sensitive reproduction datasets in our work (partial reproduced phishing emails in S6 and S8) to avoid misuse, but provide access upon reasonable request.

A-B Evaluation Metrics in Existing Works

The first aspect of evaluation is the LLM-generated phishing deception assessment. BLEU [56, 113], ROUGE [114], and perplexity [115] reflect lexical overlap, fluency, readability, or language likelihood. This is implemented when the synthesized phishing is generated using human-written phishing as baselines. The deception of LLM-generated phishing can also be evaluated by characterizing social engineering tactics such as urgency and authority [32, 34, 55, 116], or calculating the attack success rate under attacking phishing detectors [40]. In many studies, however, LLM-generated phishing samples are not evaluated and are carried into downstream feature analysis [38], user studies [65, 48], and defenses [18, 50, 51, 54, 57].

Another aspect of evaluation lies in the performance of anti-phishing countermeasures. Existing evaluation practices remain rooted in conventional approaches using accuracy, precision, recall, and F1-score. However, existing approaches overlook reporting defense performance across both phishing and benign content categories, failing to capture the practice deployment requirements. Additionally, current evaluation approaches appear insufficient to assess LLM-based analyzer accuracy, especially given the instability inherent in LLM-based reasoning [17].

A-C Benchmarking Used Datasets

We use a set of public datasets to cover different stages of the LLM-enabled phishing lifecycle. These include email datasets such as commonly used phishing corpus (e.g., Nazario [117], Millersmile [118], PhishBowl [119], and Phishbot [120]). Other datasets include Phishyai [121], E-PhishGen [122], Human–LLM generated phishing–legitimate emails [123], PiMRef Used Datasets [69], Paladin Datasets [15], Malla Phishing [88], and the adversarial BEC email dataset [124]; URL and QR-code such as fouadtrad QRcode [125], MalURLBench [126], and QGuard [127]; Vishing scripts such as AI-FraudCall-Detector [128], Audio robocall_dataset [129], Composite Scam Transcript Dataset [130], Scambaiting dataset [131], and multi-agent scam conversation [132].

For each dataset, we label the category of datasets using HW-P, HW-B, LLM-P, LLM-B, representing human-written (HW) or LLM-generated benign (B) or phishing (P) datasets. We map datasets to stages according to their metadata, original descriptions, and the primary phishing capability captured by each dataset. Datasets whose documentation indicates direct generation by malicious or phishing-oriented LLMs, such as WormGPT-style generation, are assigned to S1. For S2, we use role-playing and jailbreak-style datasets, but retain only samples that are explicitly related to phishing after preprocessing, since many jailbreak tasks are not phishing-specific. S3 is mainly supported by multi-turn conversational datasets. Attackers may decompose the phishing task, refine prompts across turns, attempt to bypass safeguards, and generate diverse intermediate outputs such as phishing emails, sensitive-information requests, or other attack artifacts.

S4 is the most extensively covered stage in public datasets; datasets are mapped to this stage when their original descriptions emphasize scenario-driven phishing generation, and we classify them according to the provided scenario or communication context. S5 is primarily represented by business email compromise and targeted phishing datasets, which capture attacks tailored to specific business roles, organizations, or individuals. S6 consists of rewriting or paraphrasing datasets, where existing phishing content is modified for fluency, contextual adaptation, or detection evasion. S7 covers cross-channel phishing datasets, including Quishing datasets represented mainly by URL-based QR-code attacks and Vishing datasets consisting of single-turn or multi-turn Vishing scripts. For S8, we reconstruct datasets based on the procedures and examples reported in [56]; however, these reproduced datasets are not publicly released due to safety concerns.

A-D Benchmarking used Visualization and Evaluation Methods

Visualization. We fit a surrogate function to approximate the labeling behavior of phishing detectors and visualize its output using contour plots. The surrogate estimates a score $s(x)\approx\Pr(f(x)=P)$ , where higher scores indicate a higher likelihood of being classified as phishing; darker regions in the contour plot correspond to higher surrogate scores. We apply principal component analysis (PCA) [133] to project feature representations into two dimensions, so the contour plot provides a two-dimensional approximation of detector behavior in the PCA-projected space. The surrogate quality is evaluated by the AUC-ROC score between surrogate scores and the original detector labels, achieving $0.86$ in our experiments. We perform a grid search over the threshold $\tau$ , with the best performance obtained at $\tau=0.5$ .

We evaluate detector performance using commonly used classification metrics, including accuracy, precision, recall, F1-score, F0.5/2-score, and MCC. In addition, we report the true negative rate (TNR) to examine how well each detector performs on benign samples, as false alarms on normal data are especially important in practical deployment. To assess whether the observed performance differences between detectors are statistically significant, we further apply the Mann–Whitney U test [106].

Appendix B Discussion on malicious QR Code identification.

We further evaluate existing industrial QR and Quishing detectors under different QR code representations. We use four detectors, including QR-malware-scanner [134], QGuard [135], Mobile-qr-code detection [136], and Quishing-ML [125], on the QR/quishing evaluation dataset described earlier. In addition to the original URL representation, we modify the QR codes in two ways. We change their colors to create Colored QR Codes, and we embed logos to create Logo+Code samples. These variants allow us to examine whether current detectors remain robust when QR phishing samples appear in more diverse visual forms.

The results show that LLM-generated Quishing content consistently reduces detector performance (Table A6). Across most detectors and QR representations, the F1 scores on the HW datasets are higher than those on the LLM datasets, indicating that LLM-generated samples are harder for existing detectors to identify. For example, QGuard drops from 68.5% on HW to 43.57% on LLM under the General URL setting. The same trend remains under Colored QR Code and Logo+Code settings, suggesting that current industrial Quishing detectors remain robust against LLM-generated content and visually modified QR codes.

Appendix C Supplementary Figures and Tables

Fig. 9 further supports the analysis in RQ2 by showing that the distribution of attack characteristics varies across the three attack vectors. Specifically, different vectors exhibit distinct patterns in terms of personalization, automation, product form, and generation difficulty, indicating that LLM-enabled phishing attacks are not homogeneous across delivery channels.

Table A7 shows that incorporating header information generally improves recall across both academic and industrial detectors. The improvement is particularly notable for XGBoost, Scamllm, SpamAssassin, and PhishingV3, where the recall increases by more than 20%. This suggests that email headers provide useful complementary signals beyond body content alone, helping detectors better identify LLM-generated phishing emails. However, the gains are not uniform across all detectors, as some models such as Pirme and Spamscanner show only marginal improvements.

Bibliography142

The reference list from the paper itself. Each links out to its DOI / PubMed record.

1[1] M. Bethany, B. Wherry, E. Bethany, N. Vishwamitra, A. Rios, and P. Najafirad, “Deciphering textual authenticity: A generalized strategy through the lens of large language semantics for detecting human vs. { \{ Machine-Generated } \} text,” in 33rd USENIX Security Symposium (USENIX Security 24) , 2024, pp. 5805–5822.
2[2] F. Carroll, J. A. Adejobi, and R. Montasari, “How good are we at detecting a phishing attack? investigating the evolving phishing attack email and why it continues to successfully deceive society,” SN Computer science , vol. 3, no. 2, p. 170, 2022.
3[3] A. R. Emanuela, B. A. Cristina, and S. Luminiţa, “Ai and prompt engineering: The new weapons of social engineering attacks,” in 2024 16th International Conference on Electronics, Computers and Artificial Intelligence (ECAI) . IEEE, 2024, pp. 1–6.
4[4] “Global phishing statistics & industry trends,” 2025, https://controld.com/blog/phishing-statistics-industry-trends .
5[5] “Barracuda 2025 phishing report,” https://blog.barracuda.com/2025/03/19/threat-spotlight-phishing-as-a-service-fast-evolving-threat .
6[6] “2025 phishing by industry benchmarking report,” https://www.knowbe 4.com/resources/reports/phishing-by-industry-benchmarking-report .
7[7] F. Heiding, S. Lermen, A. Kao, B. Schneier, and A. Vishwanath, “Evaluating large language models’ capability to launch fully automated spear phishing campaigns: Validated on human subjects,” ar Xiv preprint ar Xiv:2412.00586 , 2024.
8[8] “2025 phishing trends report,” https://hoxhunt.com/guide/phishing-trends-report .