Octavius: Mitigating Task Interference in MLLMs via LoRA-MoE
Zeren Chen, Ziqin Wang, Zhen Wang, Huayang Liu, Zhenfei Yin, Si Liu,, Lu Sheng, Wanli Ouyang, Yu Qiao, Jing Shao

TL;DR
This paper introduces Octavius, a novel framework combining LoRA and MoE techniques to mitigate task interference in Multimodal Large Language Models, significantly improving performance across diverse multimodal tasks.
Contribution
It pioneers the integration of MoE into MLLMs with LoRA to address task interference, demonstrating notable performance gains.
Findings
Approximately 20% performance improvement on downstream tasks
Effective mitigation of task interference in multimodal learning
Versatile application across 2D and 3D tasks
Abstract
Recent studies have demonstrated Large Language Models (LLMs) can extend their zero-shot generalization capabilities to multimodal learning through instruction tuning. As more modalities and downstream tasks are introduced, negative conflicts and interference may have a worse impact on performance. While this phenomenon has been overlooked in previous work, we propose a novel and extensible framework, called Octavius, for comprehensive studies and experimentation on multimodal learning with Multimodal Large Language Models (MLLMs). Specifically, we combine the well-known Mixture-of-Experts (MoE) and one of the representative PEFT techniques, i.e., LoRA, designing a novel LLM-based decoder, called LoRA-MoE, for multimodal learning. To the best of our knowledge, we are one of the pioneering efforts to introduce MoE into MLLMs to address this problem. The experimental results (about 20%…
Peer Reviews
Decision·ICLR 2024 poster
This method is straightforward but powerful, capable of integrating numerous vision tasks within this framework. Its adaptability is showcased as it seamlessly operates with both 2D images and 3D point clouds. Ultimately, the integration of LoRA with MoE for PEFT proves to be highly efficient.
Given our awareness of each example's task, an important baseline involves employing a dedicated LoRA for each task individually. Additionally, conducting an ablation study on the impact of top-k would be informative.
- The idea of MoE with sample routing to mitigate task interference for MLLMs is novel. - The author conducted thorough experiments to validate the framework on a diverse set of 2D and 3D tasks. The gains are substantial. - The framework is modular and extensible to incorporate more modalities and tasks.
- There is no analysis on how the routing among experts actually works. It would be great if the authors can provide some qualitative study of the predictions from sample-based gating network as responses to the input task, to show how the routing mechanism work. I wonder whether the gating network will simply act like a task classifier, or it's not the case. - The scaling behavior as more modalities and tasks are added is not studied. There may be limitations in very high multi-task settings.
- The paper is well-motivated -- MoEs have been shown to be useful for distributing the different types of knowledge that are required for multi-task learning, and VL instruction tuning is a good application of this insight. - The experiments are performed on both 2D image and 3D point cloud tasks, both individually and with the two datasets combined.
- I am primarily concerned by the analysis in Figure 5 -- it seems that all the 2D tasks are using only two experts! This makes me skeptical about the utility of MoE at all. Could you run the ablation in Table 5 with 2 experts? so these two experts will get selected each time and there in routing involved, but the model has the capacity to learn with 2 LoRAs at once instead of one (which is what it seems to be doing in Fig5) - An additional analysis that is needed is how the gate routing is di
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsDistributed systems and fault tolerance · Context-Aware Activity Recognition Systems
