OptMerge: Unifying Multimodal LLM Capabilities and Modalities via Model Merging
Yongxian Wei, Runxi Cheng, Weike Jin, Enneng Yang, Li Shen, Lu Hou, Sinan Du, Chun Yuan, Xiaochun Cao, Dacheng Tao

TL;DR
This paper introduces a benchmark and new methods for merging multimodal large language models, demonstrating improved performance and modality complementarity without additional training data.
Contribution
It presents a comprehensive benchmark for MLLM merging, explores merging across modalities, and proposes a noise-removal and optimization technique for better model merging.
Findings
Model merging improves MLLM performance by 2.48% on average.
Multimodal merging outperforms individual modalities.
Proposed method enhances robustness by noise removal from task vectors.
Abstract
Foundation models update slowly due to resource-intensive training, whereas domain-specific models evolve rapidly between releases. Model merging seeks to combine multiple expert models into a single, more capable model, reducing storage and serving costs while supporting decentralized development. Despite its potential, previous studies have primarily focused on merging visual classification models or Large Language Models (LLMs) for code and math tasks. Recently, Multimodal LLMs (MLLMs) that extend LLMs through large-scale multimodal training have gained traction. However, there lacks a benchmark for model merging research that clearly divides the tasks for MLLM training and evaluation. In this paper, we introduce a model merging benchmark for MLLMs, which includes multiple tasks such as VQA, Geometry, Chart, OCR, and Grounding, studying both LoRA and full fine-tuning…
Peer Reviews
Decision·ICLR 2026 Poster
1. Sufficient motivation for the problem: The value of model fusion in terms of storage, inference costs, and community collaboration is prominent. 2. Detailed benchmark construction: Adequate data volume, clear task decomposition, covering both LoRA and full parameter fine-tuning scenarios. 3. The paper is generally easy to read.
1. General algorithm innovation: The core ideas (SVD low-rank truncation, SGD implicit regularization, mean initialization) are relatively common and represent engineering improvements. 2. The "no data" expression is somewhat exaggerated: λ relies on validation set grid search, and the choice of k also references test set statistics.
### 1. Well-Aligned and Impactful Research Motivation The paper addresses two long-standing, practical pain points in MLLM development that prior work has largely overlooked: (1) the fragmentation of domain-specialized MLLMs (e.g., VQA, OCR, geometry reasoning) in open-source communities, which incurs high storage/deployment costs and fails to leverage cross-task synergy; (2) the lack of standardized benchmarks for MLLM merging—existing frameworks focus on single-modal LLMs or vision classifie
## 1. The MLLM merging benchmark lacks coverage for practical scenarios The benchmark’s design is restricted in two key ways that limit its utility for real-world merging: - **Model scale gap**: Experiments only use small-to-medium parameter models (1B InternVL2.5, 7B Qwen2-VL/Vicuna), while practical deployments rely on large-scale MLLMs (70B+; e.g., Qwen2-VL-72B). Large models have unique traits (higher parameter redundancy, sparser gradients) that may break OptMerge’s current logic (e.g.,
1. Building benchmark for model merging in Multimodal LLMs (MLLMs). The paper introduces the first model merging benchmark specifically designed for MLLMs. It is a good contribution as the author claims that they are the first to evaluate the unification of both diverse task-specific capabilities (e.g., VQA, Geometry, Chart, OCR, and Grounding) and different modalities (vision, audio, and video). 2. The proposed method OptMerge, is simple and effective. OptMerge method is data-free and exceptio
1. Lack of discussion on model scale. This paper's experiments are based on InternVL2.51B-Instruct and Qwen2-VL-7B-Base. However, they are different model series. To observe the generalization of the model scale, it is more reasonable to conduct experiments on the same model series. For example, InternV2.5 has 1B, 2B, and 8B model scales. 2. Lack of evaluating the merged models on general multimodal QA tasks. This paper merges the checkpoints of 5 abilities, including VQA, Geometry, Chart, OCR,
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsMultimodal Machine Learning Applications · Generative Adversarial Networks and Image Synthesis · Topic Modeling
