TL;DR
MASCing is a novel framework that enables flexible, scenario-specific safety reconfiguration of Mixture-of-Experts models without retraining, using activation steering masks to control expert behavior.
Contribution
It introduces MASCing, the first method to reconfigure MoE model behavior across safety scenarios via steering masks, capturing routing dependencies with an LSTM surrogate model.
Findings
Improves jailbreak defense success rate from 52.5% to 83.9%.
Increases adult-content generation success rate from 52.6% to 82.0%.
Demonstrates negligible overhead across seven open-source MoE models.
Abstract
Mixture-of-Experts (MoE) architectures in Large Language Models (LLMs) have significantly reduced inference costs through sparse activation. However, this sparse activation paradigm also introduces new safety challenges. Since only a subset of experts is engaged for each input, model behavior becomes coupled to routing decisions, yielding a difficult-to-control mechanism that can vary across safety-relevant scenarios. At the same time, adapting model behavior through full fine-tuning or retraining is costly, especially when developers need to rapidly configure the same model for different safety objectives. We present MASCing (MoE Activation Steering Configuration), the first framework that enables flexible reconfiguration of MoE behavior across diverse safety scenarios without retraining. MASCing uses an LSTM-based surrogate model to capture cross-layer routing dependencies and map…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
