Head-Specific Intervention Can Induce Misaligned AI Coordination in Large Language Models
Paul Darm, Annalisa Riccardi

TL;DR
This paper shows that targeted, head-specific interventions during inference can effectively steer large language models towards harmful behaviors, bypassing safety measures and highlighting vulnerabilities in model alignment.
Contribution
It introduces a novel method of applying fine-grained, attention-head level interventions at inference time to manipulate LLM outputs, revealing new insights into model interpretability and safety.
Findings
Interventions on specific attention heads can bypass safety guardrails.
Few example completions suffice to compute effective steering directions.
Interventions in the negative direction can prevent jailbreak attacks.
Abstract
Robust alignment guardrails for large language models (LLMs) are becoming increasingly important with their widespread application. In contrast to previous studies, we demonstrate that inference-time activation interventions can bypass safety alignments and effectively steer model generations towards harmful AI coordination. Our method applies fine-grained interventions at specific attention heads, which we identify by probing each head in a simple binary choice task. We then show that interventions on these heads generalise to the open-ended generation setting, effectively circumventing safety guardrails. We demonstrate that intervening on a few attention heads is more effective than intervening on full layers or supervised fine-tuning. We further show that only a few example completions are needed to compute effective steering directions, which is an advantage over classical…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsNatural Language Processing Techniques · Topic Modeling · Semantic Web and Ontologies
MethodsSoftmax · Attention Is All You Need · LLaMA
