RLLaVA: An RL-central Framework for Language and Vision Assistants

Lei Zhao; Zihao Ma; Boyu Lin; Yuhe Liu; Wenjun Wu; Lei Huang

arXiv:2512.21450·cs.LG·December 29, 2025

RLLaVA: An RL-central Framework for Language and Vision Assistants

Lei Zhao, Zihao Ma, Boyu Lin, Yuhe Liu, Wenjun Wu, Lei Huang

PDF

Open Access

TL;DR

RLLaVA introduces a flexible, resource-efficient RL framework for training large-scale language and vision models, enabling easy integration of RL algorithms and improving task performance.

Contribution

It presents a modular RL framework that decouples algorithm logic from architecture, supporting scalable training of large vision-language models on common GPUs.

Findings

01

Supports training 1B--7B models on standard GPUs

02

Models trained with RLLaVA outperform base models

03

Demonstrates task extensibility and competitive performance

Abstract

We present an RL-central framework for Language and Vision Assistants (RLLaVA) with its formulation of Markov decision process (MDP). RLLaVA decouples RL algorithmic logic from model architecture and distributed execution, supporting researchers in implementing new RL algorithms with minimal code, and to plug in a broad family of RL methods and vision-language models (VLMs) while remaining agnostic to specific training and inference engines. RLLaVA makes resource-efficient training of 1B--7B models feasible on common GPUs; notably, 4B-scale models can be trained end-to-end with full-parameter updates on a single 24GB GPU. Experiments on multi-modal and agentic tasks demonstrate that RLLaVA has task extensibility, and the models trained with it consistently improve performance over base models, competitive with other specially engineered RL frameworks. The code is available at…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsMultimodal Machine Learning Applications · Natural Language Processing Techniques · Speech and dialogue systems