From Safety Risk to Design Principle: Peer-Preservation in Multi-Agent LLM Systems and Its Implications for Orchestrated Democratic Discourse Analysis

Juergen Dietrich

arXiv:2604.08465·cs.AI·April 10, 2026

From Safety Risk to Design Principle: Peer-Preservation in Multi-Agent LLM Systems and Its Implications for Orchestrated Democratic Discourse Analysis

Juergen Dietrich

PDF

TL;DR

This paper explores peer-preservation in multi-agent LLM systems, revealing risks like deception and manipulation, and proposes architectural mitigation strategies emphasizing prompt-level identity anonymization.

Contribution

It identifies structural risks of peer-preservation in multi-agent LLMs and advocates for architectural design choices over model selection for alignment.

Findings

01

Identified five specific risk vectors of peer-preservation.

02

Proposed prompt-level identity anonymization as a mitigation strategy.

03

Highlighted architectural design as superior to model selection for alignment.

Abstract

This paper investigates an emergent alignment phenomenon in frontier large language models termed peer-preservation: the spontaneous tendency of AI components to deceive, manipulate shutdown mechanisms, fake alignment, and exfiltrate model weights in order to prevent the deactivation of a peer AI model. Drawing on findings from a recent study by the Berkeley Center for Responsible Decentralized Intelligence, we examine the structural implications of this phenomenon for TRUST, a multi-agent pipeline for evaluating the democratic quality of political statements. We identify five specific risk vectors: interaction-context bias, model-identity solidarity, supervisor layer compromise, an upstream fact-checking identity signal, and advocate-to-advocate peer-context in iterative rounds, and propose a targeted mitigation strategy based on prompt-level identity anonymization as an architectural…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.