HI-MoE: Hierarchical Instance-Conditioned Mixture-of-Experts for Object Detection

Vadim Vashkelis; Natalia Trukhina

arXiv:2604.04908·cs.LG·April 7, 2026

HI-MoE: Hierarchical Instance-Conditioned Mixture-of-Experts for Object Detection

Vadim Vashkelis, Natalia Trukhina

PDF

TL;DR

HI-MoE introduces a hierarchical routing architecture for object detection that improves efficiency and accuracy, especially on small objects, by aligning model computation with object-centric structure.

Contribution

The paper proposes a novel two-stage hierarchical routing method for MoE in object detection, better matching the instance-centric nature of the task.

Findings

01

HI-MoE outperforms dense DINO baseline on COCO.

02

Significant gains observed on small objects.

03

Preliminary analysis shows meaningful expert specialization.

Abstract

Mixture-of-Experts (MoE) architectures enable conditional computation by activating only a subset of model parameters for each input. Although sparse routing has been highly effective in language models and has also shown promise in vision, most vision MoE methods operate at the image or patch level. This granularity is poorly aligned with object detection, where the fundamental unit of reasoning is an object query corresponding to a candidate instance. We propose Hierarchical Instance-Conditioned Mixture-of-Experts (HI-MoE), a DETR-style detection architecture that performs routing in two stages: a lightweight scene router first selects a scene-consistent expert subset, and an instance router then assigns each object query to a small number of experts within that subset. This design aims to preserve sparse computation while better matching the heterogeneous, instance-centric structure…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.