Can a Unimodal Language Agent Provide Preferences to Tune a Multimodal Vision-Language Model?

Sazia Tabasum Mim; Jack Morris; Manish Dhakal; Yanming Xiu; Maria Gorlatova; Yi Ding

arXiv:2601.06424·cs.CL·January 13, 2026

Can a Unimodal Language Agent Provide Preferences to Tune a Multimodal Vision-Language Model?

Sazia Tabasum Mim, Jack Morris, Manish Dhakal, Yanming Xiu, Maria Gorlatova, Yi Ding

PDF

Open Access

TL;DR

This paper investigates whether a text-only language model can effectively guide a multimodal vision-language model through feedback, improving its descriptions and understanding of multimodal content.

Contribution

It introduces a method for a unimodal language model to provide feedback to a vision-language model, enhancing its multimodal description capabilities.

Findings

01

VLM description quality improved with LLM feedback

02

Maximum 13% accuracy increase in multimodal tasks

03

Human preferences aligned with LLM feedback at 64.6%

Abstract

To explore a more scalable path for adding multimodal capabilities to existing LLMs, this paper addresses a fundamental question: Can a unimodal LLM, relying solely on text, reason about its own informational needs and provide effective feedback to optimize a multimodal model? To answer this, we propose a method that enables a language agent to give feedback to a vision-language model (VLM) to adapt text generation to the agent's preferences. Our results from different experiments affirm this hypothesis, showing that LLM preference feedback significantly enhances VLM descriptions. Using our proposed method, we find that the VLM can generate multimodal scene descriptions to help the LLM better understand multimodal context, leading to improvements of maximum 13% in absolute accuracy compared to the baseline multimodal approach. Furthermore, a human study validated our AI-driven feedback,…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsMultimodal Machine Learning Applications · Topic Modeling · Speech and dialogue systems