RouteHijack: Routing-Aware Attack on Mixture-of-Experts LLMs
Zhiyuan Xu, Joseph Gardiner, Sana Belguith, Lichao Wu

TL;DR
RouteHijack is a novel routing-aware attack that manipulates expert routing in Mixture-of-Experts LLMs to bypass safety measures, revealing a fundamental vulnerability in these architectures.
Contribution
The paper introduces RouteHijack, the first routing-aware jailbreak targeting MoE LLMs, significantly improving attack success rates and demonstrating transferability across models.
Findings
RouteHijack achieves 69.3% average attack success rate across seven MoE LLMs.
The attack transfers zero-shot to five sibling MoE variants, raising ASR from 27.7% to 61.2%.
It generalizes to MoE-based VLMs, increasing ASR from 2.47% to 38.7%.
Abstract
Safety alignment is critical for the responsible deployment of large language models (LLMs). As Mixture-of-Experts (MoE) architectures are increasingly adopted to scale model capacity, understanding their safety robustness becomes essential. Existing adversarial attacks, however, have notable limitations. Prompt-based jailbreaks rely on heuristic search and transfer poorly, model intervention methods require privileged access to internal representations, and optimization-based input attacks remain output-centric and are fundamentally limited to MoE models due to the non-differentiable routing mechanism. In this paper, we present RouteHijack, a routing-aware jailbreak for MoE LLMs. Our key insight is that safety behavior is concentrated in a small subset of experts, creating an opportunity to steer model behavior by influencing routing decisions through input optimization. Building on…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
