Mixpert: Mitigating Multimodal Learning Conflicts with Efficient Mixture-of-Vision-Experts

Xin He; Xumeng Han; Longhui Wei; Lingxi Xie; Qi Tian

arXiv:2505.24541·cs.CV·June 2, 2025

Mixpert: Mitigating Multimodal Learning Conflicts with Efficient Mixture-of-Vision-Experts

Xin He, Xumeng Han, Longhui Wei, Lingxi Xie, Qi Tian

PDF

TL;DR

Mixpert introduces an efficient mixture-of-vision-experts architecture with dynamic routing to improve multimodal learning by reducing domain conflicts and enhancing task-specific performance without significant computational overhead.

Contribution

The paper proposes Mixpert, a novel multi-expert vision model with dynamic routing that maintains joint learning benefits while enabling efficient multi-task fine-tuning.

Findings

01

Significant performance improvements across multiple visual tasks.

02

Reduced domain conflicts compared to single-encoder models.

03

Efficient multi-task learning with minimal additional computational cost.

Abstract

Multimodal large language models (MLLMs) require a nuanced interpretation of complex image information, typically leveraging a vision encoder to perceive various visual scenarios. However, relying solely on a single vision encoder to handle diverse task domains proves difficult and inevitably leads to conflicts. Recent work enhances data perception by directly integrating multiple domain-specific vision encoders, yet this structure adds complexity and limits the potential for joint optimization. In this paper, we introduce Mixpert, an efficient mixture-of-vision-experts architecture that inherits the joint learning advantages from a single vision encoder while being restructured into a multi-expert paradigm for task-specific fine-tuning across different visual tasks. Additionally, we design a dynamic routing mechanism that allocates input images to the most suitable visual expert.…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.