Enhancing Vision-Language Model Training with Reinforcement Learning in Synthetic Worlds for Real-World Success
George Bredis, Stanislav Dereka, Viacheslav Sinii, Ruslan Rakhimov, Daniil Gavrilov

TL;DR
This paper introduces VL-DAC, a simple, hyperparameter-free reinforcement learning algorithm that trains vision-language models in synthetic environments, leading to improved real-world task performance without sacrificing image understanding.
Contribution
The paper presents VL-DAC, a novel RL method that decouples action token updates from environment value learning, enabling effective training of VLMs in inexpensive synthetic worlds.
Findings
VL-DAC achieves significant generalization improvements across multiple benchmarks.
Training in synthetic environments does not degrade image understanding accuracy.
The method converges faster and more reliably than previous RL approaches.
Abstract
Interactive multimodal agents must convert raw visual observations into coherent sequences of language-conditioned actions -- a capability that current vision-language models (VLMs) still lack. Earlier reinforcement-learning (RL) efforts could, in principle, endow VLMs with such skills, but they have seldom tested whether the learned behaviours generalize beyond their training simulators, and they depend either on brittle hyperparameter tuning or on dense-reward environments with low state variability. We introduce Vision-Language Decoupled Actor-Critic (VL-DAC), a lightweight, hyperparameter-free RL algorithm. VL-DAC applies PPO updates to action tokens while learning value only at the environment-step level: an arrangement, to our knowledge, not previously explored for large VLMs or LLMs. This simple decoupling removes unstable weighting terms and yields faster, more reliable…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
