RLLaVA: An RL-central Framework for Language and Vision Assistants
Lei Zhao, Zihao Ma, Boyu Lin, Yuhe Liu, Wenjun Wu, Lei Huang

TL;DR
RLLaVA introduces a flexible, resource-efficient RL framework for training large-scale language and vision models, enabling easy integration of RL algorithms and improving task performance.
Contribution
It presents a modular RL framework that decouples algorithm logic from architecture, supporting scalable training of large vision-language models on common GPUs.
Findings
Supports training 1B--7B models on standard GPUs
Models trained with RLLaVA outperform base models
Demonstrates task extensibility and competitive performance
Abstract
We present an RL-central framework for Language and Vision Assistants (RLLaVA) with its formulation of Markov decision process (MDP). RLLaVA decouples RL algorithmic logic from model architecture and distributed execution, supporting researchers in implementing new RL algorithms with minimal code, and to plug in a broad family of RL methods and vision-language models (VLMs) while remaining agnostic to specific training and inference engines. RLLaVA makes resource-efficient training of 1B--7B models feasible on common GPUs; notably, 4B-scale models can be trained end-to-end with full-parameter updates on a single 24GB GPU. Experiments on multi-modal and agentic tasks demonstrate that RLLaVA has task extensibility, and the models trained with it consistently improve performance over base models, competitive with other specially engineered RL frameworks. The code is available at…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsMultimodal Machine Learning Applications · Natural Language Processing Techniques · Speech and dialogue systems
