Causality Analysis for Evaluating the Security of Large Language Models
Wei Zhao, Zhe Li, Jun Sun

TL;DR
This paper introduces a causality-analysis framework for LLMs revealing vulnerabilities such as overfitting to harmful prompts and identifying neurons that can be exploited for Trojan attacks, highlighting security concerns.
Contribution
It presents a novel lightweight causality-analysis method at token, layer, and neuron levels for LLMs, uncovering security vulnerabilities and a mysterious high-impact neuron.
Findings
RLHF overfits models to harmful prompts
Achieved 100% attack success rate on Trojan detection tasks
Identified a neuron that can be targeted for complete model disruption
Abstract
Large Language Models (LLMs) such as GPT and Llama2 are increasingly adopted in many safety-critical applications. Their security is thus essential. Even with considerable efforts spent on reinforcement learning from human feedback (RLHF), recent studies have shown that LLMs are still subject to attacks such as adversarial perturbation and Trojan attacks. Further research is thus needed to evaluate their security and/or understand the lack of it. In this work, we propose a framework for conducting light-weight causality-analysis of LLMs at the token, layer, and neuron level. We applied our framework to open-source LLMs such as Llama2 and Vicuna and had multiple interesting discoveries. Based on a layer-level causality analysis, we show that RLHF has the effect of overfitting a model to harmful prompts. It implies that such security can be easily overcome by `unusual' harmful prompts. As…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsAdversarial Robustness in Machine Learning · Artificial Intelligence in Healthcare and Education · Topic Modeling
MethodsMulti-Head Attention · Attention Is All You Need · Linear Layer · Residual Connection · Dropout · Discriminative Fine-Tuning · Layer Normalization · Refunds@Expedia|||How do I get a full refund from Expedia? · Cosine Annealing · Dense Connections
