TL;DR
This paper introduces MMER, a training-free method to expand and retain multimodal capabilities in large language models by merging and decoupling parameters, achieving effective multimodal expansion with minimal performance loss.
Contribution
The paper proposes MMER, a novel training-free approach that merges and decouples LLM parameters to expand multimodal abilities while preserving original performance and reducing catastrophic forgetting.
Findings
MMER retains 99% of original LLM performance.
It significantly improves multimodal expansion capabilities.
It effectively mitigates catastrophic forgetting.
Abstract
Fine-tuning Large Language Models (LLMs) with multimodal encoders on modality-specific data expands the modalities that LLMs can handle, leading to the formation of Multimodal LLMs (MLLMs). However, this paradigm heavily relies on resource-intensive and inflexible fine-tuning from scratch with new multimodal data. In this paper, we propose MMER (Multi-modality Expansion and Retention), a training-free approach that integrates existing MLLMs for effective multimodal expansion while retaining their original performance. Specifically, MMER reuses MLLMs' multimodal encoders while merging their LLM parameters. By comparing original and merged LLM parameters, MMER generates binary masks to approximately separate LLM parameters for each modality. These decoupled parameters can independently process modality-specific inputs, reducing parameter conflicts and preserving original MLLMs' fidelity.…
Peer Reviews
Decision·Submitted to ICLR 2025
1. The paper offers several clear illustrations that is greatly helpful in understanding the mechanisms of parameter merging, expansion, and retention. 2. By decoupling the merged task vector and mask, the model can utilize parameters approximating to specific modality to process corresponding modality inputs, which sounds make sense. 3. This method merges the newly fine-tuned MLLM as an additional task vector, while maintaining performance on the original task.
For audio modality, the merged model only selects 2.2% parameters from merged task vector (Figure 4a), which indicates the majority of key parameters in the audio MLLM deviate from the direction of the original LLM according to the merging mechanism. However, the merged model (98% parameter from original LLM + 2% parameter from merged task vector activated by audio mask) achieves 1.5x improvement compared to original audio MLLMs, which is confused. The author attributes this result to the redu
- MMER’s training-free approach makes it practical and resource-efficient, avoiding the computational costs of extensive fine-tuning. - The method effectively retains the original performance of merged models, mitigating catastrophic forgetting and preserving their capabilities. - MMER demonstrates versatility, applicable to various modalities, and maintains performance across diverse multimodal tasks. - The paper includes extensive experiments that show consistent performance improvements over
- The paper could better explain the advantages of MMER over existing modular approaches and provide a clearer justification for adopting this monolithic method. - While the paper demonstrates effectiveness, it lacks comparisons with certain mainstream MLLMs and does not evaluate larger-scale models. Including these aspects would strengthen the argument for MMER’s superiority and provide insights into its performance at scale.
- The paper proposes a training-free MMER approach that enables seamless multimodal expansion for LLMs through multimodal parameter merging and decoupling. - The paper also leverages MMER for mitigating catastrophic forgetting in MLLMs, demonstrating its potential for continual learning applications. - Extensive experiments have been performed to demonstrate the effectiveness of MMER on multi-modality expansion, multi-modality retention and mitigating catastrophic forgetting.
- In terms of mitigating catastrophic forgetting, although MMER demonstrates good performance, considering the amount of storage required for MMER, it is unclear how this improvement compares to simply tuning task-specific adapters for new tasks. - The paper mentioned that the hyper-parameters, such as $\alpha$ and $\lambda$, are selected based on the validation set. However, the construction of this validation set is not well-detailed, making it hard to tell how well these hyper-parameters can
Videos
