When Intelligence Fails: An Empirical Study on Why LLMs Struggle with Password Cracking
Mohammad Abdul Rehman, Syed Imad Ali Shah, Abbas Anwar, Noor Islam, Hamid Khan

TL;DR
This study empirically evaluates the ability of large language models to crack passwords and finds they perform poorly compared to traditional methods, highlighting their limitations in domain-specific security tasks.
Contribution
It provides a comprehensive empirical analysis of LLMs' performance in password cracking, revealing their current limitations and the need for domain-specific fine-tuning.
Findings
LLMs achieve less than 1.5% accuracy at Hit@10 in password guessing.
Traditional rule-based methods outperform LLMs significantly.
LLMs lack effective domain adaptation and memorization for password inference.
Abstract
The remarkable capabilities of Large Language Models (LLMs) in natural language understanding and generation have sparked interest in their potential for cybersecurity applications, including password guessing. In this study, we conduct an empirical investigation into the efficacy of pre-trained LLMs for password cracking using synthetic user profiles. Specifically, we evaluate the performance of state-of-the-art open-source LLMs such as TinyLLaMA, Falcon-RW-1B, and Flan-T5 by prompting them to generate plausible passwords based on structured user attributes (e.g., name, birthdate, hobbies). Our results, measured using Hit@1, Hit@5, and Hit@10 metrics under both plaintext and SHA-256 hash comparisons, reveal consistently poor performance, with all models achieving less than 1.5% accuracy at Hit@10. In contrast, traditional rule-based and combinator-based cracking methods demonstrate…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
