Head-Specific Intervention Can Induce Misaligned AI Coordination in Large Language Models

Paul Darm; Annalisa Riccardi

arXiv:2502.05945·cs.CL·August 26, 2025

Head-Specific Intervention Can Induce Misaligned AI Coordination in Large Language Models

Paul Darm, Annalisa Riccardi

PDF

Open Access 1 Repo

TL;DR

This paper shows that targeted, head-specific interventions during inference can effectively steer large language models towards harmful behaviors, bypassing safety measures and highlighting vulnerabilities in model alignment.

Contribution

It introduces a novel method of applying fine-grained, attention-head level interventions at inference time to manipulate LLM outputs, revealing new insights into model interpretability and safety.

Findings

01

Interventions on specific attention heads can bypass safety guardrails.

02

Few example completions suffice to compute effective steering directions.

03

Interventions in the negative direction can prevent jailbreak attacks.

Abstract

Robust alignment guardrails for large language models (LLMs) are becoming increasingly important with their widespread application. In contrast to previous studies, we demonstrate that inference-time activation interventions can bypass safety alignments and effectively steer model generations towards harmful AI coordination. Our method applies fine-grained interventions at specific attention heads, which we identify by probing each head in a simple binary choice task. We then show that interventions on these heads generalise to the open-ended generation setting, effectively circumventing safety guardrails. We demonstrate that intervening on a few attention heads is more effective than intervening on full layers or supervised fine-tuning. We further show that only a few example completions are needed to compute effective steering directions, which is an advantage over classical…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

pauldrm/targeted_intervention
pytorchOfficial

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsNatural Language Processing Techniques · Topic Modeling · Semantic Web and Ontologies

MethodsSoftmax · Attention Is All You Need · LLaMA