Unified Reinforcement and Imitation Learning for Vision-Language Models

Byung-Kwan Lee; Ryo Hachiuma; Yong Man Ro; Yu-Chiang Frank Wang; Yueh-Hua Wu

arXiv:2510.19307·cs.CV·October 23, 2025

Unified Reinforcement and Imitation Learning for Vision-Language Models

Byung-Kwan Lee, Ryo Hachiuma, Yong Man Ro, Yu-Chiang Frank Wang, Yueh-Hua Wu

PDF

Open Access 1 Video

TL;DR

This paper presents a novel training algorithm called Unified Reinforcement and Imitation Learning (RIL) that creates efficient, lightweight vision-language models by combining reinforcement learning with adversarial imitation, achieving performance close to larger models.

Contribution

The paper introduces RIL, a new unified training method that integrates reinforcement and imitation learning to improve lightweight vision-language models.

Findings

01

RIL narrows the performance gap with state-of-the-art VLMs.

02

Student models outperform baseline methods in benchmarks.

03

RIL enables smaller models to mimic and improve upon large teacher models.

Abstract

Vision-Language Models (VLMs) have achieved remarkable progress, yet their large scale often renders them impractical for resource-constrained environments. This paper introduces Unified Reinforcement and Imitation Learning (RIL), a novel and efficient training algorithm designed to create powerful, lightweight VLMs. RIL distinctively combines the strengths of reinforcement learning with adversarial imitation learning. This enables smaller student VLMs not only to mimic the sophisticated text generation of large teacher models but also to systematically improve their generative capabilities through reinforcement signals. Key to our imitation framework is an LLM-based discriminator that adeptly distinguishes between student and teacher outputs, complemented by guidance from multiple large teacher VLMs to ensure diverse learning. This unified learning strategy, leveraging both…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

Unified Reinforcement and Imitation Learning for Vision-Language Models· slideslive

Taxonomy

TopicsMultimodal Machine Learning Applications · Domain Adaptation and Few-Shot Learning · Advanced Neural Network Applications