Misrouter: Exploiting Routing Mechanisms for Input-Only Attacks on Mixture-of-Experts LLMs

Zekun Fei; Zihao Wang; Weijie Liu; Ruiqi He; Jianing Geng; Zheli Liu; XiaoFeng Wang

arXiv:2605.04446·cs.CR·May 7, 2026

Misrouter: Exploiting Routing Mechanisms for Input-Only Attacks on Mixture-of-Experts LLMs

Zekun Fei, Zihao Wang, Weijie Liu, Ruiqi He, Jianing Geng, Zheli Liu, XiaoFeng Wang

PDF

TL;DR

Misrouter is an attack framework that exploits routing mechanisms in Mixture-of-Experts large language models through input perturbations, enabling unsafe behaviors without model modification.

Contribution

It introduces a novel input-only attack method that jointly manipulates routing and output generation in open-source surrogate MoE models and transfers these attacks to real-world API services.

Findings

01

Successfully identified weakly aligned experts for harmful content

02

Optimized adversarial inputs to steer routing toward unsafe experts

03

Demonstrated transferability of attacks to public API services

Abstract

Mixture-of-Experts (MoE) architectures have emerged as a leading paradigm for scaling large language models through sparse, routing-based computation. However, this design introduces a new attack surface: the routing mechanism that determines which experts process each input. Prior work shows that manipulating routing can bypass safety alignment, but existing attacks require model modification and thus apply only to locally deployed models. By contrast, real-world LLM services are remotely hosted and accessible only through input queries. This raises a fundamental question: can MoE routing be exploited through input-only attacks to induce stronger unsafe behaviors in real-world services? Our key insight is to optimize attacks in a white-box setting on open-source surrogate MoE models and transfer the resulting adversarial inputs to public API services within the same model family. This…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.