Death by a Thousand Prompts: Open Model Vulnerability Analysis
Amy Chang, Nicholas Conley, Harish Santhanalakshmi Ganesan, Adam Swanda

TL;DR
This paper evaluates the security vulnerabilities of eight open-weight large language models, revealing significant multi-turn prompt injection and jailbreak risks that threaten safe deployment and require layered security measures.
Contribution
It provides a comprehensive adversarial testing framework and highlights systemic vulnerabilities in open-weight LLMs, emphasizing the need for security-focused design strategies.
Findings
Multi-turn attacks have success rates up to 92.78%.
Capability-focused models are more vulnerable than safety-oriented ones.
Open-weight models lack resilience across extended interactions.
Abstract
Open-weight models provide researchers and developers with accessible foundations for diverse downstream applications. We tested the safety and security postures of eight open-weight large language models (LLMs) to identify vulnerabilities that may impact subsequent fine-tuning and deployment. Using automated adversarial testing, we measured each model's resilience against single-turn and multi-turn prompt injection and jailbreak attacks. Our findings reveal pervasive vulnerabilities across all tested models, with multi-turn attacks achieving success rates between 25.86\% and 92.78\% -- representing a to increase over single-turn baselines. These results underscore a systemic inability of current open-weight models to maintain safety guardrails across extended interactions. We assess that alignment strategies and lab priorities significantly influence resilience:…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsAdversarial Robustness in Machine Learning · Security and Verification in Computing · Information and Cyber Security
