Among Us: Measuring and Mitigating Malicious Contributions in Model Collaboration Systems

Ziyuan Yang; Wenxuan Ding; Shangbin Feng; Yulia Tsvetkov

arXiv:2602.05176·cs.CL·February 6, 2026

Among Us: Measuring and Mitigating Malicious Contributions in Model Collaboration Systems

Ziyuan Yang, Wenxuan Ding, Shangbin Feng, Yulia Tsvetkov

PDF

Open Access

TL;DR

This paper investigates the risks posed by malicious models in collaborative language model systems, quantifies their impact, and proposes mitigation strategies that significantly reduce their influence, enhancing safety and robustness.

Contribution

It introduces a comprehensive evaluation of malicious models in multi-LLM systems and proposes external supervision methods to mitigate their impact.

Findings

01

Malicious models severely degrade system performance, especially in reasoning and safety tasks.

02

Mitigation strategies recover over 95% of performance loss caused by malicious models.

03

Fully resistant systems to malicious models remain an open challenge.

Abstract

Language models (LMs) are increasingly used in collaboration: multiple LMs trained by different parties collaborate through routing systems, multi-agent debate, model merging, and more. Critical safety risks remain in this decentralized paradigm: what if some of the models in multi-LLM systems are compromised or malicious? We first quantify the impact of malicious models by engineering four categories of malicious LMs, plug them into four types of popular model collaboration systems, and evaluate the compromised system across 10 datasets. We find that malicious models have a severe impact on the multi-LLM systems, especially for reasoning and safety domains where performance is lowered by 7.12% and 7.94% on average. We then propose mitigation strategies to alleviate the impact of malicious components, by employing external supervisors that oversee model collaboration to disable/mask…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsTopic Modeling · Advanced Graph Neural Networks · Artificial Intelligence in Healthcare and Education