TL;DR
This paper introduces SAFEx, a framework for identifying and mitigating safety vulnerabilities in MoE-based large language models by analyzing expert modules responsible for safety-critical behaviors.
Contribution
SAFEx provides a systematic method to identify, characterize, and intervene on safety-critical experts in MoE models, addressing a unique safety challenge not present in dense models.
Findings
Disabling selected experts reduces harmful response rates by 22%.
Expert-level interventions can improve safety without full-model retraining.
SAFEx reveals safety behavior is highly concentrated in specific experts.
Abstract
Large language models with Mixture-of-Experts (MoE) architectures achieve efficiency and scalability, yet their routing mechanisms introduce safety alignment challenges insufficiently addressed by techniques developed for dense models. In this work, the MoE-specific safety risk of positional vulnerability-that safety-aligned behaviors rely on specific expert modules-is formalized and systematically analyzed. An analytical framework, SAFEx, is presented to robustly identify, characterize, and validate safety-critical experts via a stability-based expert selection procedure, and to decompose them into two functional groups: the Harmful Content Detection Group (HCDG), which specializes in identifying and recognizing harmful content within user inputs, and the Harmful Response Control Group (HRCG), which specializes in controlling and enforcing model behaviors to generate appropriate safety…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
Taxonomy
MethodsSparse Evolutionary Training · Mixture of Experts
