Towards Efficient Online Tuning of VLM Agents via Counterfactual Soft Reinforcement Learning

Lang Feng; Weihao Tan; Zhiyi Lyu; Longtao Zheng; Haiyang Xu; Ming Yan; Fei Huang; Bo An

arXiv:2505.03792·cs.LG·June 4, 2025

Towards Efficient Online Tuning of VLM Agents via Counterfactual Soft Reinforcement Learning

Lang Feng, Weihao Tan, Zhiyi Lyu, Longtao Zheng, Haiyang Xu, Ming Yan, Fei Huang, Bo An

PDF

Open Access

TL;DR

This paper introduces CoSo, a novel counterfactual soft reinforcement learning method that improves online fine-tuning of vision-language model agents by focusing exploration on critical tokens, leading to better efficiency and performance.

Contribution

The paper proposes CoSo, a new RL approach that uses counterfactual reasoning to target important tokens in textual actions, enhancing online exploration for VLM agents.

Findings

01

CoSo improves exploration efficiency in diverse tasks.

02

It achieves consistent performance gains over prior methods.

03

Theoretical guarantees support CoSo's convergence and policy improvement.

Abstract

Online fine-tuning vision-language model (VLM) agents with reinforcement learning (RL) has shown promise for equipping agents with multi-step, goal-oriented capabilities in dynamic environments. However, their open-ended textual action space and non-end-to-end nature of action generation present significant challenges to effective online exploration in RL, e.g., explosion of the exploration space. We propose a novel online fine-tuning method, Counterfactual Soft Reinforcement Learning (CoSo), better suited to the textual output space of VLM agents. Compared to prior methods that assign uniform uncertainty to all tokens, CoSo leverages counterfactual reasoning to dynamically assess the causal influence of individual tokens on post-processed actions. By prioritizing the exploration of action-critical tokens while reducing the impact of semantically redundant or low-impact tokens, CoSo…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsMultimodal Machine Learning Applications · Reinforcement Learning in Robotics · Social Robot Interaction and HRI