Large Language Lobotomy: Jailbreaking Mixture-of-Experts via Expert Silencing

Jona te Lintelo; Lichao Wu; Stjepan Picek

arXiv:2602.08741·cs.CR·February 10, 2026

Large Language Lobotomy: Jailbreaking Mixture-of-Experts via Expert Silencing

Jona te Lintelo, Lichao Wu, Stjepan Picek

PDF

Open Access

TL;DR

This paper introduces Large Language Lobotomy (L³), a training-free attack exploiting expert routing in MoE LLMs to compromise safety, revealing a trade-off between efficiency and safety robustness.

Contribution

L³ is a novel, architecture-agnostic method that silences safety-critical experts in MoE LLMs, significantly increasing attack success rates without retraining.

Findings

01

L³ increases attack success from 7.3% to 70.4%.

02

Silencing fewer than 20% of experts can bypass safety guardrails.

03

MoE safety behaviors are concentrated in a small set of experts.

Abstract

The rapid adoption of Mixture-of-Experts (MoE) architectures marks a major shift in the deployment of Large Language Models (LLMs). MoE LLMs improve scaling efficiency by activating only a small subset of parameters per token, but their routing structure introduces new safety attack surfaces. We find that safety-critical behaviors in MoE LLMs (e.g., refusal) are concentrated in a small set of experts rather than being uniformly distributed. Building on this, we propose Large Language Lobotomy (L $^{3}$ ), a training-free, architecture-agnostic attack that compromises safety alignment by exploiting expert routing dynamics. L $^{3}$ learns routing patterns that correlate with refusal, attributes safety behavior to specific experts, and adaptively silences the most safety-relevant experts until harmful outputs are produced. We evaluate L $^{3}$ on eight state-of-the-art open-source MoE LLMs and show…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsAdversarial Robustness in Machine Learning · Artificial Intelligence in Healthcare and Education · Topic Modeling