COVR:Collaborative Optimization of VLMs and RL Agent for Visual-Based Control

Canming Xia; Peixi Peng; Guang Tan; Zhan Su; Haoran Xu; Zhenxian Liu; Luntong Li

arXiv:2601.06122·cs.CV·January 13, 2026

COVR:Collaborative Optimization of VLMs and RL Agent for Visual-Based Control

Canming Xia, Peixi Peng, Guang Tan, Zhan Su, Haoran Xu, Zhenxian Liu, Luntong Li

PDF

Open Access

TL;DR

COVR introduces a collaborative framework that mutually enhances vision-language models and reinforcement learning policies through RL-generated data, improving sample efficiency and performance in visual control tasks.

Contribution

The paper proposes a novel collaborative optimization framework that jointly fine-tunes VLMs and RL policies, incorporating new modules for efficient training and mutual enhancement.

Findings

01

COVR outperforms existing methods on various visual control benchmarks.

02

The proposed modules improve training stability and exploration efficiency.

03

Mutual enhancement leads to better semantic reasoning and policy learning.

Abstract

Visual reinforcement learning (RL) suffers from poor sample efficiency due to high-dimensional observations in complex tasks. While existing works have shown that vision-language models (VLMs) can assist RL, they often focus on knowledge distillation from the VLM to RL, overlooking the potential of RL-generated interaction data to enhance the VLM. To address this, we propose COVR, a collaborative optimization framework that enables the mutual enhancement of the VLM and RL policies. Specifically, COVR fine-tunes the VLM with RL-generated data to enhance the semantic reasoning ability consistent with the target task, and uses the enhanced VLM to further guide policy learning via action priors. To improve fine-tuning efficiency, we introduce two key modules: (1) an Exploration-Driven Dynamic Filter module that preserves valuable exploration samples using adaptive thresholds based on the…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsMultimodal Machine Learning Applications · Reinforcement Learning in Robotics · Domain Adaptation and Few-Shot Learning