When Does Sparse MoE Help in Vision? The Role of Backbone Compute Leverage in Sparse Routing

Libo Sun; Po-wei Harn; Peixiong He; Xiao Qin

arXiv:2605.15484·cs.CV·May 18, 2026

When Does Sparse MoE Help in Vision? The Role of Backbone Compute Leverage in Sparse Routing

Libo Sun, Po-wei Harn, Peixiong He, Xiao Qin

PDF

1 Repo

TL;DR

This paper investigates when sparse Mixture-of-Experts (MoE) models improve vision classification performance, emphasizing the importance of compute leverage and multi-expert routing, with extensive experiments across benchmarks.

Contribution

It identifies the compute-leverage pattern and the necessity of multi-expert routing for effective sparse MoE deployment in vision tasks.

Findings

01

Positive accuracy gains require routing a substantial fraction of FLOPs.

02

Multi-expert routing ($k \,\geq\, 2$) is necessary at large scales like ImageNet.

03

Softmax-based per-sample dispatch can rescue performance in certain settings.

Abstract

Mixture-of-Experts (MoE) networks promise favorable accuracy-compute trade-offs, yet practical vision deployments are hindered by expert collapse and limited end-to-end efficiency gains. We study when sparse top- $k$ routing with hard capacity constraints helps in vision classification, evaluated under multi-seed protocols on four benchmarks (CIFAR-10/100, Tiny-ImageNet, ImageNet-1K). We observe a \emph{compute-leverage pattern}: positive accuracy gaps require a substantial fraction $ρ$ of total FLOPs to be routed; at ImageNet scale this is necessary but not sufficient, as multi-expert routing ( $k \geq 2$ ) is additionally required. Two controlled experiments isolate these factors. A hidden-size sweep on CIFAR-10 yields both predicted sign reversals across standard and depthwise backbones, ruling out backbone family as the active variable. An ImageNet-1K ablation that varies only…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

libophd/sparse-moe-vision-rho
github

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.