LaViDa: A Large Diffusion Language Model for Multimodal Understanding

Shufan Li; Konstantinos Kallidromitis; Hritik Bansal; Akash Gokul; Yusuke Kato; Kazuki Kozuka; Jason Kuen; Zhe Lin; Kai-Wei Chang; Aditya Grover

arXiv:2505.16839·cs.CV·June 19, 2025

LaViDa: A Large Diffusion Language Model for Multimodal Understanding

Shufan Li, Konstantinos Kallidromitis, Hritik Bansal, Akash Gokul, Yusuke Kato, Kazuki Kozuka, Jason Kuen, Zhe Lin, Kai-Wei Chang, Aditya Grover

PDF

1 Repo 1 Models

TL;DR

LaViDa introduces a diffusion-based vision-language model that offers faster inference, controllable generation, and bidirectional reasoning, outperforming autoregressive models on key multimodal benchmarks.

Contribution

This paper presents LaViDa, a novel multimodal VLM built on diffusion models, with techniques for effective training, inference efficiency, and improved multimodal understanding.

Findings

01

Outperforms AR VLMs on MMMU benchmark

02

Surpasses Open-LLaVa-Next-8B on COCO captioning by +4.1 CIDEr

03

Achieves +59% improvement on Constrained Poem Completion

Abstract

Modern Vision-Language Models (VLMs) can solve a wide range of tasks requiring visual reasoning. In real-world scenarios, desirable properties for VLMs include fast inference and controllable generation (e.g., constraining outputs to adhere to a desired format). However, existing autoregressive (AR) VLMs like LLaVA struggle in these aspects. Discrete diffusion models (DMs) offer a promising alternative, enabling parallel decoding for faster inference and bidirectional context for controllable generation through text-infilling. While effective in language-only settings, DMs' potential for multimodal tasks is underexplored. We introduce LaViDa, a family of VLMs built on DMs. We build LaViDa by equipping DMs with a vision encoder and jointly fine-tune the combined parts for multimodal instruction following. To address challenges encountered, LaViDa incorporates novel techniques such as…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

jacklishufan/lavida
pytorchOfficial

Models

🤗
KonstantinosKK/lavida-llada-v1.0-instruct-hf-transformers
model· 2.1k dl
2.1k dl

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

MethodsDiffusion