Large Language Lobotomy: Jailbreaking Mixture-of-Experts via Expert Silencing
Jona te Lintelo, Lichao Wu, Stjepan Picek

TL;DR
This paper introduces Large Language Lobotomy (L³), a training-free attack exploiting expert routing in MoE LLMs to compromise safety, revealing a trade-off between efficiency and safety robustness.
Contribution
L³ is a novel, architecture-agnostic method that silences safety-critical experts in MoE LLMs, significantly increasing attack success rates without retraining.
Findings
L³ increases attack success from 7.3% to 70.4%.
Silencing fewer than 20% of experts can bypass safety guardrails.
MoE safety behaviors are concentrated in a small set of experts.
Abstract
The rapid adoption of Mixture-of-Experts (MoE) architectures marks a major shift in the deployment of Large Language Models (LLMs). MoE LLMs improve scaling efficiency by activating only a small subset of parameters per token, but their routing structure introduces new safety attack surfaces. We find that safety-critical behaviors in MoE LLMs (e.g., refusal) are concentrated in a small set of experts rather than being uniformly distributed. Building on this, we propose Large Language Lobotomy (L), a training-free, architecture-agnostic attack that compromises safety alignment by exploiting expert routing dynamics. L learns routing patterns that correlate with refusal, attributes safety behavior to specific experts, and adaptively silences the most safety-relevant experts until harmful outputs are produced. We evaluate L on eight state-of-the-art open-source MoE LLMs and show…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsAdversarial Robustness in Machine Learning · Artificial Intelligence in Healthcare and Education · Topic Modeling
