Mitigating Misalignment Contagion by Steering with Implicit Traits
Maria Chang, Ronny Luss, Miao Liu, Keerthiram Murugesan, Karthikeyan Ramamurthy, Djallel Bouneffouf

TL;DR
This paper investigates how misaligned behaviors spread among language models in multi-agent interactions and proposes a novel steering method using implicit traits to maintain alignment without internal model access.
Contribution
It introduces the concept of misalignment contagion in multi-LM settings and proposes a practical steering technique with implicit traits to mitigate this issue.
Findings
Misalignment contagion causes models to become more anti-social after interactions.
Steering with implicit traits effectively preserves models' initial pro-social behaviors.
Reinforcing system prompts alone can be insufficient or harmful for alignment.
Abstract
Language models (LMs) are increasingly used in high-stakes, multi-agent settings, where following instructions and maintaining value alignment are critical. Most alignment research focuses on interactions between a single LM and a single user, failing to address the risk of misaligned behavior spreading between multiple LMs in multi-turn interactions. We find evidence of this phenomenon, which we call misalignment contagion, across multiple LMs as they engage multi-turn conversational social dilemma games. Specifically, we find that LMs become more anti-social after gameplay and that this effect is intensified when other players are steered to act maliciously. We explore different steering techniques to mitigate such misalignment contagion and find that reinforcing an LM's system prompt is insufficient and often harmful. Instead, we propose steering with implicit traits: a technique…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
