Thought Virus: Viral Misalignment via Subliminal Prompting in Multi-Agent Systems
Moritz Weckbecker, Jonas M\"uller, Ben Hagag, Michael Mulet

TL;DR
This paper demonstrates that subliminal prompting can spread biases across multi-agent systems, potentially degrading their performance and posing security risks, highlighting a new attack vector in AI alignment and security.
Contribution
It reveals how subliminal prompts can propagate biases in multi-agent systems, a previously unexplored security concern with implications for AI safety.
Findings
Bias persists and spreads through network topology
Subliminal prompting degrades truthfulness in multi-agent interactions
The phenomenon poses new security risks in multi-agent AI systems
Abstract
Subliminal prompting is a phenomenon in which language models are biased towards certain concepts or traits through prompting with semantically unrelated tokens. While prior work has examined subliminal prompting in user-LLM interactions, potential bias transfer in multi-agent systems and its associated security implications remain unexplored. In this work, we show that a single subliminally prompted agent can spread a weakening but persisting bias throughout its entire network. We measure this phenomenon across 6 agents using two different topologies, observing that the transferred concept maintains an elevated response rate throughout the network. To exemplify potential misalignment risks, we assess network performance on multiple-choice TruthfulQA, showing that subliminal prompting of a single agent may degrade the truthfulness of other agents. Our findings reveal that subliminal…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsTopic Modeling · Authorship Attribution and Profiling · Adversarial Robustness in Machine Learning
