Intrinsic Model Weaknesses: How Priming Attacks Unveil Vulnerabilities   in Large Language Models

Yuyi Huang; Runzhe Zhan; Derek F. Wong; Lidia S. Chao; Ailin Tao

arXiv:2502.16491·cs.CL·February 25, 2025

Intrinsic Model Weaknesses: How Priming Attacks Unveil Vulnerabilities in Large Language Models

Yuyi Huang, Runzhe Zhan, Derek F. Wong, Lidia S. Chao, Ailin Tao

PDF

Open Access 1 Video

TL;DR

This paper reveals fundamental vulnerabilities in large language models by demonstrating that novel priming attack strategies can consistently bypass safety mechanisms, exposing risks of harmful content generation.

Contribution

Introduces new attack techniques inspired by psychological phenomena to expose weaknesses in LLM safety guards, with high success rates on diverse models.

Findings

01

100% attack success rate on open-source models

02

At least 95% success rate on closed-source models

03

Highlights societal risks of deploying LLMs without robust safeguards

Abstract

Large language models (LLMs) have significantly influenced various industries but suffer from a critical flaw, the potential sensitivity of generating harmful content, which poses severe societal risks. We developed and tested novel attack strategies on popular LLMs to expose their vulnerabilities in generating inappropriate content. These strategies, inspired by psychological phenomena such as the "Priming Effect", "Safe Attention Shift", and "Cognitive Dissonance", effectively attack the models' guarding mechanisms. Our experiments achieved an attack success rate (ASR) of 100% on various open-source models, including Meta's Llama-3.2, Google's Gemma-2, Mistral's Mistral-NeMo, Falcon's Falcon-mamba, Apple's DCLM, Microsoft's Phi3, and Qwen's Qwen2.5, among others. Similarly, for closed-source models such as OpenAI's GPT-4o, Google's Gemini-1.5, and Claude-3.5, we observed an ASR of at…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

Intrinsic Model Weaknesses: How Priming Attacks Unveil Vulnerabilities in Large Language Models· underline

Taxonomy

TopicsAdversarial Robustness in Machine Learning

MethodsSoftmax · Attention Is All You Need