Securing the Floor and Raising the Ceiling: A Merging-based Paradigm for Multi-modal Search Agents
Zhixiang Wang, Jingxuan Xu, Dajun Chen, Yunfang Wu, Wei Jiang, Yong Li

TL;DR
This paper introduces a training-free method for enhancing vision-language models with multi-modal search capabilities through cross-modal model merging, using an optimal merging algorithm to improve performance without additional training.
Contribution
It presents a novel, training-free paradigm for multi-modal search by merging models and introduces OBM, a saliency-aware merging algorithm that improves performance and convergence.
Findings
Model merging provides a solid zero-shot performance baseline.
OBM outperforms standard merging methods in search accuracy.
The approach achieves faster convergence and higher peak accuracy.
Abstract
Recent advances in Vision-Language Models (VLMs) have motivated the development of multi-modal search agents that can actively invoke external search tools and integrate retrieved evidence through multi-step reasoning. While promising, existing approaches typically rely on large-scale supervised trajectories or expensive reinforcement learning (RL), leading to high training cost, instability, and a severe cold-start problem for standard VLMs. We propose a training-free paradigm to empower VLMs with autonomous search capabilities via cross-modal model merging. By fusing a text-based search agent with a base VLM, we show that multi-modal search capabilities can be effectively composed without any additional multi-modal training data. To mitigate parameter interference during cross-modal integration, we introduce Optimal Brain Merging (OBM), a saliency-aware merging algorithm that…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsMultimodal Machine Learning Applications · Topic Modeling · Advanced Image and Video Retrieval Techniques
