VLM Q-Learning: Aligning Vision-Language Models for Interactive Decision-Making
Jake Grigsby, Yuke Zhu, Michael Ryoo, Juan Carlos Niebles

TL;DR
This paper introduces VLM Q-Learning, a method to fine-tune vision-language models for interactive decision-making tasks using reinforcement learning, improving their ability to follow strict output requirements and learn from experience.
Contribution
It presents an off-policy RL approach that enhances open-weight VLMs for agent tasks, combining stability of supervised fine-tuning with self-improvement capabilities.
Findings
VLM Q-Learning improves task-specific performance of VLMs.
The method enables learning from low-quality datasets.
It demonstrates success across multiple multi-modal domains.
Abstract
Recent research looks to harness the general knowledge and reasoning of large language models (LLMs) into agents that accomplish user-specified goals in interactive environments. Vision-language models (VLMs) extend LLMs to multi-modal data and provide agents with the visual reasoning necessary for new applications in areas such as computer automation. However, agent tasks emphasize skills where accessible open-weight VLMs lag behind their LLM equivalents. For example, VLMs are less capable of following an environment's strict output syntax requirements and are more focused on open-ended question answering. Overcoming these limitations requires supervised fine-tuning (SFT) on task-specific expert demonstrations. Our work approaches these challenges from an offline-to-online reinforcement learning (RL) perspective. RL lets us fine-tune VLMs to agent tasks while learning from the…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsSemantic Web and Ontologies
MethodsShrink and Fine-Tune
