PlaM: Training-Free Plateau-Guided Model Merging for Better Visual Grounding in MLLMs

Zijing Wang; Yongkang Liu; Mingyang Wang; Ercong Nie; Deyuan Chen; Zhengjie Zhao; Shi Feng; Daling Wang; Xiaocui Yang; Yifei Zhang; Hinrich Sch\"utze

arXiv:2601.07645·cs.CL·January 13, 2026

PlaM: Training-Free Plateau-Guided Model Merging for Better Visual Grounding in MLLMs

Zijing Wang, Yongkang Liu, Mingyang Wang, Ercong Nie, Deyuan Chen, Zhengjie Zhao, Shi Feng, Daling Wang, Xiaocui Yang, Yifei Zhang, Hinrich Sch\"utze

PDF

Open Access

TL;DR

This paper introduces a training-free, layer-wise merging method for multimodal large language models that improves visual grounding by selectively integrating base language model parameters, addressing reasoning degradation caused by fine-tuning.

Contribution

It proposes a novel plateau-guided model merging technique that enhances multimodal reasoning without additional training, based on layer-wise analysis of model behavior.

Findings

01

Effective across five MLLMs and nine benchmarks

02

Improves focus on task-relevant visual regions

03

Shifts attention from scattered to localized patterns

Abstract

Multimodal Large Language Models (MLLMs) rely on strong linguistic reasoning inherited from their base language models. However, multimodal instruction fine-tuning paradoxically degrades this text's reasoning capability, undermining multimodal performance. To address this issue, we propose a training-free framework to mitigate this degradation. Through layer-wise vision token masking, we reveal a common three-stage pattern in multimodal large language models: early-modal separation, mid-modal alignment, and late-modal degradation. By analyzing the behavior of MLLMs at different stages, we propose a plateau-guided model merging method that selectively injects base language model parameters into MLLMs. Experimental results based on five MLLMs on nine benchmarks demonstrate the effectiveness of our method. Attention-based analysis further reveals that merging shifts attention from diffuse,…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsMultimodal Machine Learning Applications · Domain Adaptation and Few-Shot Learning · Advanced Neural Network Applications