Bring Reason to Vision: Understanding Perception and Reasoning through Model Merging

Shiqi Chen; Jinghan Zhang; Tongyao Zhu; Wei Liu; Siyang Gao; Miao Xiong; Manling Li; Junxian He

arXiv:2505.05464·cs.CL·July 16, 2025

Bring Reason to Vision: Understanding Perception and Reasoning through Model Merging

Shiqi Chen, Jinghan Zhang, Tongyao Zhu, Wei Liu, Siyang Gao, Miao Xiong, Manling Li, Junxian He

PDF

Open Access 1 Repo

TL;DR

This paper introduces a model merging technique to combine vision and reasoning capabilities from different models, enabling training-free transfer of reasoning from LLMs to VLMs and providing insights into internal mechanisms.

Contribution

It proposes a novel method of merging models across modalities to transfer reasoning abilities without additional training, and analyzes how perception and reasoning are represented internally.

Findings

01

Merging models transfers reasoning capabilities in a training-free manner.

02

Perception is mainly encoded in early layers, reasoning in middle-to-late layers.

03

Post-merging, all layers contribute to reasoning, perception distribution remains stable.

Abstract

Vision-Language Models (VLMs) combine visual perception with the general capabilities, such as reasoning, of Large Language Models (LLMs). However, the mechanisms by which these two abilities can be combined and contribute remain poorly understood. In this work, we explore to compose perception and reasoning through model merging that connects parameters of different models. Unlike previous works that often focus on merging models of the same kind, we propose merging models across modalities, enabling the incorporation of the reasoning capabilities of LLMs into VLMs. Through extensive experiments, we demonstrate that model merging offers a successful pathway to transfer reasoning abilities from LLMs to VLMs in a training-free manner. Moreover, we utilize the merged models to understand the internal mechanism of perception and reasoning and how merging affects it. We find that perception…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

shiqichen17/vlm_merging
pytorchOfficial

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsMultimodal Machine Learning Applications · Language, Metaphor, and Cognition · Explainable Artificial Intelligence (XAI)

MethodsFocus