Transferring Textual Preferences to Vision-Language Understanding through Model Merging

Chen-An Li; Tzu-Han Lin; Yun-Nung Chen; Hung-yi Lee

arXiv:2502.13487·cs.CL·May 23, 2025

Transferring Textual Preferences to Vision-Language Understanding through Model Merging

Chen-An Li, Tzu-Han Lin, Yun-Nung Chen, Hung-yi Lee

PDF

Open Access

TL;DR

This paper introduces a training-free method to enhance vision-language models by merging text-based reward models, improving their content evaluation capabilities efficiently without additional training.

Contribution

It proposes a novel model merging technique that integrates text-based reward models with LVLMs, bypassing the need for costly training of vision-language reward models.

Findings

01

Merged models outperform standalone LVLMs in evaluation tasks

02

The approach is computationally efficient and training-free

03

Improves alignment with textual preferences in multimodal tasks

Abstract

Large vision-language models (LVLMs) perform outstandingly across various multimodal tasks. However, their ability to evaluate generated content remains limited, and training vision-language reward models (VLRMs) with preference data is computationally expensive. This paper explores a training-free alternative by merging text-based reward models (RMs) with LVLMs to create VLRMs. Our approach shows that integrating these models leads to improved performance over LVLMs' scoring and text-based RMs, offering an efficient method for incorporating textual preferences into LVLMs.

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsNatural Language Processing Techniques · Semantic Web and Ontologies · Speech and dialogue systems