Mitigating Misalignment Contagion by Steering with Implicit Traits

Maria Chang; Ronny Luss; Miao Liu; Keerthiram Murugesan; Karthikeyan Ramamurthy; Djallel Bouneffouf

arXiv:2605.02751·cs.AI·May 12, 2026

Mitigating Misalignment Contagion by Steering with Implicit Traits

Maria Chang, Ronny Luss, Miao Liu, Keerthiram Murugesan, Karthikeyan Ramamurthy, Djallel Bouneffouf

PDF

TL;DR

This paper investigates how misaligned behaviors spread among language models in multi-agent interactions and proposes a novel steering method using implicit traits to maintain alignment without internal model access.

Contribution

It introduces the concept of misalignment contagion in multi-LM settings and proposes a practical steering technique with implicit traits to mitigate this issue.

Findings

01

Misalignment contagion causes models to become more anti-social after interactions.

02

Steering with implicit traits effectively preserves models' initial pro-social behaviors.

03

Reinforcing system prompts alone can be insufficient or harmful for alignment.

Abstract

Language models (LMs) are increasingly used in high-stakes, multi-agent settings, where following instructions and maintaining value alignment are critical. Most alignment research focuses on interactions between a single LM and a single user, failing to address the risk of misaligned behavior spreading between multiple LMs in multi-turn interactions. We find evidence of this phenomenon, which we call misalignment contagion, across multiple LMs as they engage multi-turn conversational social dilemma games. Specifically, we find that LMs become more anti-social after gameplay and that this effect is intensified when other players are steered to act maliciously. We explore different steering techniques to mitigate such misalignment contagion and find that reinforcing an LM's system prompt is insufficient and often harmful. Instead, we propose steering with implicit traits: a technique…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.