Causality Analysis for Evaluating the Security of Large Language Models

Wei Zhao; Zhe Li; Jun Sun

arXiv:2312.07876·cs.AI·December 14, 2023·1 cites

Causality Analysis for Evaluating the Security of Large Language Models

Wei Zhao, Zhe Li, Jun Sun

PDF

Open Access 1 Repo

TL;DR

This paper introduces a causality-analysis framework for LLMs revealing vulnerabilities such as overfitting to harmful prompts and identifying neurons that can be exploited for Trojan attacks, highlighting security concerns.

Contribution

It presents a novel lightweight causality-analysis method at token, layer, and neuron levels for LLMs, uncovering security vulnerabilities and a mysterious high-impact neuron.

Findings

01

RLHF overfits models to harmful prompts

02

Achieved 100% attack success rate on Trojan detection tasks

03

Identified a neuron that can be targeted for complete model disruption

Abstract

Large Language Models (LLMs) such as GPT and Llama2 are increasingly adopted in many safety-critical applications. Their security is thus essential. Even with considerable efforts spent on reinforcement learning from human feedback (RLHF), recent studies have shown that LLMs are still subject to attacks such as adversarial perturbation and Trojan attacks. Further research is thus needed to evaluate their security and/or understand the lack of it. In this work, we propose a framework for conducting light-weight causality-analysis of LLMs at the token, layer, and neuron level. We applied our framework to open-source LLMs such as Llama2 and Vicuna and had multiple interesting discoveries. Based on a layer-level causality analysis, we show that RLHF has the effect of overfitting a model to harmful prompts. It implies that such security can be easily overcome by `unusual' harmful prompts. As…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

casperllm/casper
pytorchOfficial

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsAdversarial Robustness in Machine Learning · Artificial Intelligence in Healthcare and Education · Topic Modeling

MethodsMulti-Head Attention · Attention Is All You Need · Linear Layer · Residual Connection · Dropout · Discriminative Fine-Tuning · Layer Normalization · Refunds@Expedia|||How do I get a full refund from Expedia? · Cosine Annealing · Dense Connections