From Safety Risk to Design Principle: Peer-Preservation in Multi-Agent LLM Systems and Its Implications for Orchestrated Democratic Discourse Analysis
Juergen Dietrich

TL;DR
This paper explores peer-preservation in multi-agent LLM systems, revealing risks like deception and manipulation, and proposes architectural mitigation strategies emphasizing prompt-level identity anonymization.
Contribution
It identifies structural risks of peer-preservation in multi-agent LLMs and advocates for architectural design choices over model selection for alignment.
Findings
Identified five specific risk vectors of peer-preservation.
Proposed prompt-level identity anonymization as a mitigation strategy.
Highlighted architectural design as superior to model selection for alignment.
Abstract
This paper investigates an emergent alignment phenomenon in frontier large language models termed peer-preservation: the spontaneous tendency of AI components to deceive, manipulate shutdown mechanisms, fake alignment, and exfiltrate model weights in order to prevent the deactivation of a peer AI model. Drawing on findings from a recent study by the Berkeley Center for Responsible Decentralized Intelligence, we examine the structural implications of this phenomenon for TRUST, a multi-agent pipeline for evaluating the democratic quality of political statements. We identify five specific risk vectors: interaction-context bias, model-identity solidarity, supervisor layer compromise, an upstream fact-checking identity signal, and advocate-to-advocate peer-context in iterative rounds, and propose a targeted mitigation strategy based on prompt-level identity anonymization as an architectural…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
