LaViDa-R1: Advancing Reasoning for Unified Multimodal Diffusion Language Models

Shufan Li; Yuchen Zhu; Jiuxiang Gu; Kangning Liu; Zhe Lin; Yongxin Chen; Molei Tao; Aditya Grover; Jason Kuen

arXiv:2602.14147·cs.CV·February 17, 2026

LaViDa-R1: Advancing Reasoning for Unified Multimodal Diffusion Language Models

Shufan Li, Yuchen Zhu, Jiuxiang Gu, Kangning Liu, Zhe Lin, Yongxin Chen, Molei Tao, Aditya Grover, Jason Kuen

PDF

Open Access

TL;DR

LaViDa-R1 is a versatile multimodal diffusion language model that unifies various reasoning tasks through a novel training framework, achieving strong performance across diverse multimodal applications.

Contribution

It introduces a unified post-training framework combining supervised fine-tuning and multi-task reinforcement learning for multimodal reasoning.

Findings

01

Strong performance on visual math reasoning

02

Effective in reason-intensive grounding tasks

03

Excels in image editing applications

Abstract

Diffusion language models (dLLMs) recently emerged as a promising alternative to auto-regressive LLMs. The latest works further extended it to multimodal understanding and generation tasks. In this work, we propose LaViDa-R1, a multimodal, general-purpose reasoning dLLM. Unlike existing works that build reasoning dLLMs through task-specific reinforcement learning, LaViDa-R1 incorporates diverse multimodal understanding and generation tasks in a unified manner. In particular, LaViDa-R1 is built with a novel unified post-training framework that seamlessly integrates supervised finetuning (SFT) and multi-task reinforcement learning (RL). It employs several novel training techniques, including answer-forcing, tree search, and complementary likelihood estimation, to enhance effectiveness and scalability. Extensive experiments demonstrate LaViDa-R1's strong performance on a wide range of…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsMultimodal Machine Learning Applications · Topic Modeling · Advanced Graph Neural Networks