SAFEx: Analyzing Vulnerabilities of MoE-Based LLMs via Stable Safety-critical Expert Identification

Zhenglin Lai; Mengyao Liao; Bingzhe Wu; Dong Xu; Zebin Zhao; Zhihang Yuan; Chao Fan; Jianqiang Li

arXiv:2506.17368·cs.LG·October 14, 2025

SAFEx: Analyzing Vulnerabilities of MoE-Based LLMs via Stable Safety-critical Expert Identification

Zhenglin Lai, Mengyao Liao, Bingzhe Wu, Dong Xu, Zebin Zhao, Zhihang Yuan, Chao Fan, Jianqiang Li

PDF

1 Video

TL;DR

This paper introduces SAFEx, a framework for identifying and mitigating safety vulnerabilities in MoE-based large language models by analyzing expert modules responsible for safety-critical behaviors.

Contribution

SAFEx provides a systematic method to identify, characterize, and intervene on safety-critical experts in MoE models, addressing a unique safety challenge not present in dense models.

Findings

01

Disabling selected experts reduces harmful response rates by 22%.

02

Expert-level interventions can improve safety without full-model retraining.

03

SAFEx reveals safety behavior is highly concentrated in specific experts.

Abstract

Large language models with Mixture-of-Experts (MoE) architectures achieve efficiency and scalability, yet their routing mechanisms introduce safety alignment challenges insufficiently addressed by techniques developed for dense models. In this work, the MoE-specific safety risk of positional vulnerability-that safety-aligned behaviors rely on specific expert modules-is formalized and systematically analyzed. An analytical framework, SAFEx, is presented to robustly identify, characterize, and validate safety-critical experts via a stability-based expert selection procedure, and to decompose them into two functional groups: the Harmful Content Detection Group (HCDG), which specializes in identifying and recognizing harmful content within user inputs, and the Harmful Response Control Group (HRCG), which specializes in controlling and enforcing model behaviors to generate appropriate safety…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

SAFEx: Analyzing Vulnerabilities of MoE-Based LLMs via Stable Safety-critical Expert Identification· slideslive

Taxonomy

MethodsSparse Evolutionary Training · Mixture of Experts