Tuning Large Multimodal Models for Videos using Reinforcement Learning from AI Feedback
Daechul Ahn, Yura Choi, Youngjae Yu, Dongyeop Kang, Jonghyun Choi

TL;DR
This paper introduces VLM-RLAIF, a novel reinforcement learning approach that uses AI-generated feedback to improve the alignment of video and text in large multimodal models, outperforming previous methods.
Contribution
The paper proposes RLAIF, a self-supervised reinforcement learning strategy with context-aware reward modeling for better video-text alignment in multimodal models.
Findings
VLM-RLAIF outperforms existing models on diverse benchmarks.
Self-preference feedback enhances multimodal alignment.
Open-sourcing promotes further research.
Abstract
Recent advancements in large language models have influenced the development of video large multimodal models (VLMMs). The previous approaches for VLMMs involved Supervised Fine-Tuning (SFT) with instruction-tuned datasets, integrating LLM with visual encoders, and adding additional learnable modules. Video and text multimodal alignment remains challenging, primarily due to the deficient volume and quality of multimodal instruction-tune data compared to text-only data. We present a novel alignment strategy that employs multimodal AI system to oversee itself called Reinforcement Learning from AI Feedback (RLAIF), providing self-preference feedback to refine itself and facilitating the alignment of video and text modalities. In specific, we propose context-aware reward modeling by providing detailed video descriptions as context during the generation of preference feedback in order to…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsNeural Networks and Reservoir Computing · Reinforcement Learning in Robotics
MethodsReinforcement Learning from AI Feedback · Shrink and Fine-Tune
