RouteHijack: Routing-Aware Attack on Mixture-of-Experts LLMs

Zhiyuan Xu; Joseph Gardiner; Sana Belguith; Lichao Wu

arXiv:2605.02946·cs.LG·May 6, 2026

RouteHijack: Routing-Aware Attack on Mixture-of-Experts LLMs

Zhiyuan Xu, Joseph Gardiner, Sana Belguith, Lichao Wu

PDF

TL;DR

RouteHijack is a novel routing-aware attack that manipulates expert routing in Mixture-of-Experts LLMs to bypass safety measures, revealing a fundamental vulnerability in these architectures.

Contribution

The paper introduces RouteHijack, the first routing-aware jailbreak targeting MoE LLMs, significantly improving attack success rates and demonstrating transferability across models.

Findings

01

RouteHijack achieves 69.3% average attack success rate across seven MoE LLMs.

02

The attack transfers zero-shot to five sibling MoE variants, raising ASR from 27.7% to 61.2%.

03

It generalizes to MoE-based VLMs, increasing ASR from 2.47% to 38.7%.

Abstract

Safety alignment is critical for the responsible deployment of large language models (LLMs). As Mixture-of-Experts (MoE) architectures are increasingly adopted to scale model capacity, understanding their safety robustness becomes essential. Existing adversarial attacks, however, have notable limitations. Prompt-based jailbreaks rely on heuristic search and transfer poorly, model intervention methods require privileged access to internal representations, and optimization-based input attacks remain output-centric and are fundamentally limited to MoE models due to the non-differentiable routing mechanism. In this paper, we present RouteHijack, a routing-aware jailbreak for MoE LLMs. Our key insight is that safety behavior is concentrated in a small subset of experts, creating an opportunity to steer model behavior by influencing routing decisions through input optimization. Building on…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.