Sparse-Dense Mixture of Experts Adapter for Multi-Modal Tracking
Yabin Zhu, Jianqi Li, Chenglong Li, Jiaxiang Wang, Chengjie Gu, Jin Tang

TL;DR
This paper introduces a novel Sparse-Dense Mixture of Experts Adapter framework for multi-modal tracking, improving multi-modal feature representation and high-order correlation modeling with efficient parameter usage.
Contribution
The paper proposes a unified PEFT framework with a sparse-dense MoE and a Gram-based hypergraph fusion for enhanced multi-modal tracking performance.
Findings
Achieves superior results on multiple multi-modal tracking benchmarks.
Effectively models modality-specific and shared information.
Enhances high-order multi-modal feature fusion.
Abstract
Parameter-efficient fine-tuning (PEFT) techniques, such as prompts and adapters, are widely used in multi-modal tracking because they alleviate issues of full-model fine-tuning, including time inefficiency, high resource consumption, parameter storage burden, and catastrophic forgetting. However, due to cross-modal heterogeneity, most existing PEFT-based methods struggle to effectively represent multi-modal features within a unified framework with shared parameters. To address this problem, we propose a novel Sparse-Dense Mixture of Experts Adapter (SDMoEA) framework for PEFT-based multi-modal tracking under a unified model structure. Specifically, we design an SDMoE module as the multi-modal adapter to model modality-specific and shared information efficiently. SDMoE consists of a sparse MoE and a dense-shared MoE: the former captures modality-specific information, while the latter…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsVideo Surveillance and Tracking Methods · Human Pose and Action Recognition · Speech and Audio Processing
