Awaker2.5-VL: Stably Scaling MLLMs with Parameter-Efficient Mixture of   Experts

Jinqiang Long; Yanqi Dai; Guoxing Yang; Hongpeng Lin; Nanyi Fei,; Yizhao Gao; Zhiwu Lu

arXiv:2411.10669·cs.CV·November 19, 2024

Awaker2.5-VL: Stably Scaling MLLMs with Parameter-Efficient Mixture of Experts

Jinqiang Long, Yanqi Dai, Guoxing Yang, Hongpeng Lin, Nanyi Fei,, Yizhao Gao, Zhiwu Lu

PDF

Open Access 1 Repo

TL;DR

Awaker2.5-VL introduces a sparse Mixture of Experts architecture with low-rank adaptations to improve multi-task performance in Multimodal Large Language Models, effectively addressing multi-task conflicts and enhancing training efficiency.

Contribution

The paper presents a novel MoE-based architecture with LoRA for scalable, multi-task MLLMs, mitigating multi-task conflicts and improving training and inference speed.

Findings

01

Effective multi-task learning across diverse tasks

02

Improved performance on multiple benchmarks

03

Faster training and inference with LoRA experts

Abstract

As the research of Multimodal Large Language Models (MLLMs) becomes popular, an advancing MLLM model is typically required to handle various textual and visual tasks (e.g., VQA, Detection, OCR, and ChartQA) simultaneously for real-world applications. However, due to the significant differences in representation and distribution among data from various tasks, simply mixing data of all tasks together leads to the well-known``multi-task conflict" issue, resulting in performance degradation across various tasks. To address this issue, we propose Awaker2.5-VL, a Mixture of Experts~(MoE) architecture suitable for MLLM, which acquires the multi-task capabilities through multiple sparsely activated experts. To speed up the training and inference of Awaker2.5-VL, each expert in our model is devised as a low-rank adaptation (LoRA) structure. Extensive experiments on multiple latest benchmarks…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

metabrainagi/awaker
jaxOfficial

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsTarget Tracking and Data Fusion in Sensor Networks · Inertial Sensor and Navigation · Indoor and Outdoor Localization Technologies

MethodsSPEED: Separable Pyramidal Pooling EncodEr-Decoder for Real-Time Monocular Depth Estimation on Low-Resource Settings