Transferring Textual Preferences to Vision-Language Understanding through Model Merging
Chen-An Li, Tzu-Han Lin, Yun-Nung Chen, Hung-yi Lee

TL;DR
This paper introduces a training-free method to enhance vision-language models by merging text-based reward models, improving their content evaluation capabilities efficiently without additional training.
Contribution
It proposes a novel model merging technique that integrates text-based reward models with LVLMs, bypassing the need for costly training of vision-language reward models.
Findings
Merged models outperform standalone LVLMs in evaluation tasks
The approach is computationally efficient and training-free
Improves alignment with textual preferences in multimodal tasks
Abstract
Large vision-language models (LVLMs) perform outstandingly across various multimodal tasks. However, their ability to evaluate generated content remains limited, and training vision-language reward models (VLRMs) with preference data is computationally expensive. This paper explores a training-free alternative by merging text-based reward models (RMs) with LVLMs to create VLRMs. Our approach shows that integrating these models leads to improved performance over LVLMs' scoring and text-based RMs, offering an efficient method for incorporating textual preferences into LVLMs.
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsNatural Language Processing Techniques · Semantic Web and Ontologies · Speech and dialogue systems
