Securing the Floor and Raising the Ceiling: A Merging-based Paradigm for Multi-modal Search Agents

Zhixiang Wang; Jingxuan Xu; Dajun Chen; Yunfang Wu; Wei Jiang; Yong Li

arXiv:2603.01416·cs.AI·March 3, 2026

Securing the Floor and Raising the Ceiling: A Merging-based Paradigm for Multi-modal Search Agents

Zhixiang Wang, Jingxuan Xu, Dajun Chen, Yunfang Wu, Wei Jiang, Yong Li

PDF

Open Access

TL;DR

This paper introduces a training-free method for enhancing vision-language models with multi-modal search capabilities through cross-modal model merging, using an optimal merging algorithm to improve performance without additional training.

Contribution

It presents a novel, training-free paradigm for multi-modal search by merging models and introduces OBM, a saliency-aware merging algorithm that improves performance and convergence.

Findings

01

Model merging provides a solid zero-shot performance baseline.

02

OBM outperforms standard merging methods in search accuracy.

03

The approach achieves faster convergence and higher peak accuracy.

Abstract

Recent advances in Vision-Language Models (VLMs) have motivated the development of multi-modal search agents that can actively invoke external search tools and integrate retrieved evidence through multi-step reasoning. While promising, existing approaches typically rely on large-scale supervised trajectories or expensive reinforcement learning (RL), leading to high training cost, instability, and a severe cold-start problem for standard VLMs. We propose a training-free paradigm to empower VLMs with autonomous search capabilities via cross-modal model merging. By fusing a text-based search agent with a base VLM, we show that multi-modal search capabilities can be effectively composed without any additional multi-modal training data. To mitigate parameter interference during cross-modal integration, we introduce Optimal Brain Merging (OBM), a saliency-aware merging algorithm that…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsMultimodal Machine Learning Applications · Topic Modeling · Advanced Image and Video Retrieval Techniques