VLM Q-Learning: Aligning Vision-Language Models for Interactive   Decision-Making

Jake Grigsby; Yuke Zhu; Michael Ryoo; Juan Carlos Niebles

arXiv:2505.03181·cs.LG·May 7, 2025

VLM Q-Learning: Aligning Vision-Language Models for Interactive Decision-Making

Jake Grigsby, Yuke Zhu, Michael Ryoo, Juan Carlos Niebles

PDF

Open Access

TL;DR

This paper introduces VLM Q-Learning, a method to fine-tune vision-language models for interactive decision-making tasks using reinforcement learning, improving their ability to follow strict output requirements and learn from experience.

Contribution

It presents an off-policy RL approach that enhances open-weight VLMs for agent tasks, combining stability of supervised fine-tuning with self-improvement capabilities.

Findings

01

VLM Q-Learning improves task-specific performance of VLMs.

02

The method enables learning from low-quality datasets.

03

It demonstrates success across multiple multi-modal domains.

Abstract

Recent research looks to harness the general knowledge and reasoning of large language models (LLMs) into agents that accomplish user-specified goals in interactive environments. Vision-language models (VLMs) extend LLMs to multi-modal data and provide agents with the visual reasoning necessary for new applications in areas such as computer automation. However, agent tasks emphasize skills where accessible open-weight VLMs lag behind their LLM equivalents. For example, VLMs are less capable of following an environment's strict output syntax requirements and are more focused on open-ended question answering. Overcoming these limitations requires supervised fine-tuning (SFT) on task-specific expert demonstrations. Our work approaches these challenges from an offline-to-online reinforcement learning (RL) perspective. RL lets us fine-tune VLMs to agent tasks while learning from the…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsSemantic Web and Ontologies

MethodsShrink and Fine-Tune