LVRPO: Language-Visual Alignment with GRPO for Multimodal Understanding and Generation

Shentong Mo; Sukmin Yun

arXiv:2603.27693·cs.CV·March 31, 2026

LVRPO: Language-Visual Alignment with GRPO for Multimodal Understanding and Generation

Shentong Mo, Sukmin Yun

PDF

TL;DR

LVRPO introduces a reinforcement learning framework that explicitly aligns language and visual representations, improving multimodal understanding and generation without auxiliary encoders.

Contribution

It proposes a preference-driven reinforcement approach using GRPO for explicit language-visual alignment in multimodal models.

Findings

01

Outperforms baseline models on multiple multimodal benchmarks.

02

Enhances both understanding and generation capabilities.

03

Does not require additional cross-modal loss functions.

Abstract

Unified multimodal pretraining has emerged as a promising paradigm for jointly modeling language and vision within a single foundation model. However, existing approaches largely rely on implicit or indirect alignment signals and remain suboptimal for simultaneously supporting multimodal understanding and generation, particularly in settings that require fine-grained language-visual reasoning and controllable generation. In this work, we propose LVRPO, a language-visual reinforcement-based preference optimization framework that explicitly aligns language and visual representations using Group Relative Policy Optimization (GRPO). Instead of introducing additional alignment losses at the representation level, LVRPO directly optimizes multimodal model behaviors through preference-driven reinforcement signals, encouraging consistent and semantically grounded interactions between language…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.