Investigating Cross-Modal Skill Injection: Scenarios, Methods, and Hyperparameters
Zhiyu Xu, Lean Wang, Yuanxin Liu, Lei Li, Hao Zhou, Fandong Meng, Jie Zhou, Xu Sun

TL;DR
This paper systematically analyzes cross-modal skill injection in vision-language models, focusing on scenarios, methods, and hyperparameters, to enhance understanding and effectiveness of this transfer technique.
Contribution
It provides a comprehensive evaluation of cross-modal skill injection, identifying effective scenarios, comparing merging methods, and analyzing hyperparameter impacts.
Findings
Cross-modal skill injection performs well in instruction-following and cross-lingual tasks.
Classic merging methods like TA and DARE outperform alternatives.
Hyperparameter tuning is critical for the success of merging methods.
Abstract
Vision-Language Models (VLMs) have demonstrated remarkable proficiency in general multi-modal understanding; yet they struggle to efficiently acquire continually evolving domain-specific skills. Conventional approaches to enhancing VLM capabilities, such as Supervised Fine-Tuning (SFT), require extensive dataset curation and substantial computational resources. Model merging has emerged as an efficient alternative that enables the transfer of domain-specific expertise from Large Language Models (LLMs) to VLMs without incurring additional training data requirements or significant computational overhead. Unlike conventional merging of homogeneous LLMs, which mainly aggregates existing capabilities, cross-modal skill injection aims to induce emergent cross-modal capabilities by integrating a domain-expert LLM into a VLM. However, existing research lacks a systematic analysis of the…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
