Misrouter: Exploiting Routing Mechanisms for Input-Only Attacks on Mixture-of-Experts LLMs
Zekun Fei, Zihao Wang, Weijie Liu, Ruiqi He, Jianing Geng, Zheli Liu, XiaoFeng Wang

TL;DR
Misrouter is an attack framework that exploits routing mechanisms in Mixture-of-Experts large language models through input perturbations, enabling unsafe behaviors without model modification.
Contribution
It introduces a novel input-only attack method that jointly manipulates routing and output generation in open-source surrogate MoE models and transfers these attacks to real-world API services.
Findings
Successfully identified weakly aligned experts for harmful content
Optimized adversarial inputs to steer routing toward unsafe experts
Demonstrated transferability of attacks to public API services
Abstract
Mixture-of-Experts (MoE) architectures have emerged as a leading paradigm for scaling large language models through sparse, routing-based computation. However, this design introduces a new attack surface: the routing mechanism that determines which experts process each input. Prior work shows that manipulating routing can bypass safety alignment, but existing attacks require model modification and thus apply only to locally deployed models. By contrast, real-world LLM services are remotely hosted and accessible only through input queries. This raises a fundamental question: can MoE routing be exploited through input-only attacks to induce stronger unsafe behaviors in real-world services? Our key insight is to optimize attacks in a white-box setting on open-source surrogate MoE models and transfer the resulting adversarial inputs to public API services within the same model family. This…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
