Dynamic-DINO: Fine-Grained Mixture of Experts Tuning for Real-time Open-Vocabulary Object Detection

Yehao Lu; Minghe Weng; Zekang Xiao; Rui Jiang; Wei Su; Guangcong Zheng; Ping Lu; Xi Li

arXiv:2507.17436·cs.CV·July 24, 2025

Dynamic-DINO: Fine-Grained Mixture of Experts Tuning for Real-time Open-Vocabulary Object Detection

Yehao Lu, Minghe Weng, Zekang Xiao, Rui Jiang, Wei Su, Guangcong Zheng, Ping Lu, Xi Li

PDF

Open Access

TL;DR

This paper introduces Dynamic-DINO, a novel MoE-based fine-tuning framework for real-time open-vocabulary object detection, which dynamically activates experts during inference to improve performance with limited data.

Contribution

It proposes a dynamic MoE tuning method with a granularity decomposition mechanism and expert weight allocation, enhancing open-vocabulary detection performance.

Findings

01

Outperforms Grounding DINO 1.5 Edge pretrained on private data.

02

Efficient MoE tuning with only 1.56M open-source data.

03

Dynamic expert activation improves detection accuracy.

Abstract

The Mixture of Experts (MoE) architecture has excelled in Large Vision-Language Models (LVLMs), yet its potential in real-time open-vocabulary object detectors, which also leverage large-scale vision-language datasets but smaller models, remains unexplored. This work investigates this domain, revealing intriguing insights. In the shallow layers, experts tend to cooperate with diverse peers to expand the search space. While in the deeper layers, fixed collaborative structures emerge, where each expert maintains 2-3 fixed partners and distinct expert combinations are specialized in processing specific patterns. Concretely, we propose Dynamic-DINO, which extends Grounding DINO 1.5 Edge from a dense model to a dynamic inference framework via an efficient MoE-Tuning strategy. Additionally, we design a granularity decomposition mechanism to decompose the Feed-Forward Network (FFN) of base…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsNatural Language Processing Techniques · Topic Modeling · Speech and dialogue systems