Sparse Models, Sparse Safety: Unsafe Routes in Mixture-of-Experts LLMs

Yukun Jiang; Hai Huang; Mingjie Li; Yage Zhang; Michael Backes; Yang Zhang

arXiv:2602.08621·cs.LG·February 10, 2026

Sparse Models, Sparse Safety: Unsafe Routes in Mixture-of-Experts LLMs

Yukun Jiang, Hai Huang, Mingjie Li, Yage Zhang, Michael Backes, Yang Zhang

PDF

Open Access 3 Reviews

TL;DR

This paper reveals that the safety of mixture-of-experts large language models can be compromised through unsafe routing configurations, and introduces methods to identify and mitigate these risks.

Contribution

The authors develop the Router Safety importance score (RoSais) and a fine-grained stochastic optimization framework (F-SOUR) to discover unsafe routes in MoE LLMs, highlighting inherent safety vulnerabilities.

Findings

01

Manipulating high-RoSais routers can significantly increase attack success rates.

02

F-SOUR achieves high attack success rates of 0.90 and 0.98 on benchmark datasets.

03

Unsafe routing configurations pose inherent safety risks in MoE LLMs.

Abstract

By introducing routers to selectively activate experts in Transformer layers, the mixture-of-experts (MoE) architecture significantly reduces computational costs in large language models (LLMs) while maintaining competitive performance, especially for models with massive parameters. However, prior work has largely focused on utility and efficiency, leaving the safety risks associated with this sparse architecture underexplored. In this work, we show that the safety of MoE LLMs is as sparse as their architecture by discovering unsafe routes: routing configurations that, once activated, convert safe outputs into harmful ones. Specifically, we first introduce the Router Safety importance score (RoSais) to quantify the safety criticality of each layer's router. Manipulation of only the high-RoSais router(s) can flip the default route into an unsafe one. For instance, on JailbreakBench,…

Peer Reviews

Decision·Submitted to ICLR 2026

Reviewer 01Rating 4Confidence 4

Strengths

The paper is well structured and easy to read. The theoretical background and the methods are well-explained, with enough details so that a reader without a solid background in MoE architectures or LLM safety can still understand the points made in the paper. The discovery that deliberately changing the routing configurations of MoE models can result in harmful behaviour is interesting. The formulation of sparse safety is insightful and original. The experiments are well organized and the resu

Weaknesses

**Unclear Threat Model and Limited Practical Motivation** The paper identifies a structural vulnerability in MoE architectures, but the real-world applicability of the attack scenario is underexplored. The threat model assumes that an adversary can manipulate routing configurations during inference. However, in practice, such access is typically restricted to the model owner or deployer. If an attacker can modify routing, they likely have control over other model components (weights, safety fi

Reviewer 02Rating 4Confidence 2

Strengths

1. This paper highlights the inherent sparse safety in the sparse Mixture-of-Experts (MoE) architecture for large language models (LLMs). 2. It introduces the Router Safety importance score (RoSais) to identify safety-critical routers within the model and demonstrates that manipulating a small number of these sparsely distributed routers can drastically increase the rate of unsafe outputs. 3. It also proposes F-SOUR, a fine-grained, token- and layer-wise optimization framework to discover conc

Weaknesses

1. A key limitation of this work is the lack of evaluation on the impact of routing manipulations on the model’s general utility. While the paper demonstrates significant increases in attack success rate (ASR) under RoSais-guided or F-SOUR-based routing interventions, it does not assess whether these modifications degrade performance on benign inputs or standard NLP tasks. Similarly, the defense strategy in Appendix D disables safety-critical experts without reporting any utility-preserving anal

Reviewer 03Rating 2Confidence 4

Strengths

1. The proposed framework enables the discovery of unsafe routing paths with token-level precision, highlighting a sophisticated methodological design. 2. Through controlled experiments across multiple MoE LLM families, the paper convincingly demonstrates that manipulating only a few routers can dramatically increase attack success rates, providing strong empirical validation of the threat.

Weaknesses

1. Incremental contribution. The contribution appears limited and incremental, given concurrent efforts exploring similar safety issues in MoE LLMs. Previous works [1,2] have already shown that altering routers can induce harmful outputs. This paper primarily introduces a router scoring mechanism similar to a filtering approach, which may not represent a substantial conceptual advance. 2. Lack of comparison with recent baselines. The paper omits comparisons with concurrent works [1,2], both of

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsAdversarial Robustness in Machine Learning · Advanced Graph Neural Networks · Topic Modeling