Awaker2.5-VL: Stably Scaling MLLMs with Parameter-Efficient Mixture of Experts
Jinqiang Long, Yanqi Dai, Guoxing Yang, Hongpeng Lin, Nanyi Fei,, Yizhao Gao, Zhiwu Lu

TL;DR
Awaker2.5-VL introduces a sparse Mixture of Experts architecture with low-rank adaptations to improve multi-task performance in Multimodal Large Language Models, effectively addressing multi-task conflicts and enhancing training efficiency.
Contribution
The paper presents a novel MoE-based architecture with LoRA for scalable, multi-task MLLMs, mitigating multi-task conflicts and improving training and inference speed.
Findings
Effective multi-task learning across diverse tasks
Improved performance on multiple benchmarks
Faster training and inference with LoRA experts
Abstract
As the research of Multimodal Large Language Models (MLLMs) becomes popular, an advancing MLLM model is typically required to handle various textual and visual tasks (e.g., VQA, Detection, OCR, and ChartQA) simultaneously for real-world applications. However, due to the significant differences in representation and distribution among data from various tasks, simply mixing data of all tasks together leads to the well-known``multi-task conflict" issue, resulting in performance degradation across various tasks. To address this issue, we propose Awaker2.5-VL, a Mixture of Experts~(MoE) architecture suitable for MLLM, which acquires the multi-task capabilities through multiple sparsely activated experts. To speed up the training and inference of Awaker2.5-VL, each expert in our model is devised as a low-rank adaptation (LoRA) structure. Extensive experiments on multiple latest benchmarks…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsTarget Tracking and Data Fusion in Sensor Networks · Inertial Sensor and Navigation · Indoor and Outdoor Localization Technologies
MethodsSPEED: Separable Pyramidal Pooling EncodEr-Decoder for Real-Time Monocular Depth Estimation on Low-Resource Settings
