Sparse-Dense Mixture of Experts Adapter for Multi-Modal Tracking

Yabin Zhu; Jianqi Li; Chenglong Li; Jiaxiang Wang; Chengjie Gu; Jin Tang

arXiv:2603.13719·cs.CV·March 17, 2026

Sparse-Dense Mixture of Experts Adapter for Multi-Modal Tracking

Yabin Zhu, Jianqi Li, Chenglong Li, Jiaxiang Wang, Chengjie Gu, Jin Tang

PDF

Open Access

TL;DR

This paper introduces a novel Sparse-Dense Mixture of Experts Adapter framework for multi-modal tracking, improving multi-modal feature representation and high-order correlation modeling with efficient parameter usage.

Contribution

The paper proposes a unified PEFT framework with a sparse-dense MoE and a Gram-based hypergraph fusion for enhanced multi-modal tracking performance.

Findings

01

Achieves superior results on multiple multi-modal tracking benchmarks.

02

Effectively models modality-specific and shared information.

03

Enhances high-order multi-modal feature fusion.

Abstract

Parameter-efficient fine-tuning (PEFT) techniques, such as prompts and adapters, are widely used in multi-modal tracking because they alleviate issues of full-model fine-tuning, including time inefficiency, high resource consumption, parameter storage burden, and catastrophic forgetting. However, due to cross-modal heterogeneity, most existing PEFT-based methods struggle to effectively represent multi-modal features within a unified framework with shared parameters. To address this problem, we propose a novel Sparse-Dense Mixture of Experts Adapter (SDMoEA) framework for PEFT-based multi-modal tracking under a unified model structure. Specifically, we design an SDMoE module as the multi-modal adapter to model modality-specific and shared information efficiently. SDMoE consists of a sparse MoE and a dense-shared MoE: the former captures modality-specific information, while the latter…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsVideo Surveillance and Tracking Methods · Human Pose and Action Recognition · Speech and Audio Processing