How Can We Effectively Use LLMs for Phishing Detection?: Evaluating the Effectiveness of Large Language Model-based Phishing Detection Models
Fujiao Ji, Doowon Kim

TL;DR
This paper evaluates the effectiveness of large language models for phishing detection, analyzing input modalities, prompt strategies, and temperature settings, and compares their performance to traditional deep learning methods.
Contribution
It provides a comprehensive assessment of LLM-based phishing detection, highlighting optimal input modalities and configurations, and comparing commercial and open-source models.
Findings
Commercial LLMs outperform open-source models in detection accuracy.
Screenshot inputs yield the best brand identification results.
Higher temperature settings decrease model performance.
Abstract
Large language models (LLMs) have emerged as a promising phishing detection mechanism, addressing the limitations of traditional deep learning-based detectors, including poor generalization to previously unseen websites and a lack of interpretability. However, LLMs' effectiveness for phishing detection remains unexplored. This study investigates how to effectively leverage LLMs for phishing detection (including target brand identification) by examining the impact of input modalities (screenshots, logos, HTML, and URLs), temperature settings, and prompt engineering strategies. Using a dataset of 19,131 real-world phishing websites and 243 benign sites, we evaluate seven LLMs -- two commercial models (GPT 4.1 and Gemini 2.0 flash) and five open-source models (Qwen, Llama, Janus, DeepSeek-VL2, and R1) -- alongside two deep learning (DL)-based baselines (PhishIntention and Phishpedia).…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsSpam and Phishing Detection · Misinformation and Its Impacts · Imbalanced Data Classification Techniques
