LLaDA-V: Large Language Diffusion Models with Visual Instruction Tuning

Zebin You; Shen Nie; Xiaolu Zhang; Jun Hu; Jun Zhou; Zhiwu Lu; Ji-Rong Wen; Chongxuan Li

arXiv:2505.16933·cs.LG·June 5, 2025

LLaDA-V: Large Language Diffusion Models with Visual Instruction Tuning

Zebin You, Shen Nie, Xiaolu Zhang, Jun Hu, Jun Zhou, Zhiwu Lu, Ji-Rong Wen, Chongxuan Li

PDF

Open Access 1 Models

TL;DR

LLaDA-V introduces a diffusion-based multimodal large language model that effectively integrates visual instruction tuning, achieving state-of-the-art performance in multimodal understanding despite weaker textual-only performance.

Contribution

This work presents LLaDA-V, a novel diffusion-based multimodal large language model that combines visual instruction tuning with masked diffusion models, diverging from autoregressive approaches.

Findings

01

LLaDA-V performs well on multimodal tasks despite weaker textual performance.

02

LLaDA-V is competitive with LLaMA3-V and narrows the gap with Qwen2-VL.

03

It achieves state-of-the-art results in multimodal understanding.

Abstract

In this work, we introduce LLaDA-V, a purely diffusion-based Multimodal Large Language Model (MLLM) that integrates visual instruction tuning with masked diffusion models, representing a departure from the autoregressive paradigms dominant in current multimodal approaches. Built upon LLaDA, a representative large language diffusion model, LLaDA-V incorporates a vision encoder and MLP connector that projects visual features into the language embedding space, enabling effective multimodal alignment. Our empirical investigation reveals several intriguing results: First, LLaDA-V demonstrates promising multimodal performance despite its language model being weaker on purely textual tasks than counterparts like LLaMA3-8B and Qwen2-7B. When trained on the same instruction data, LLaDA-V is highly competitive to LLaMA3-V across multimodal tasks with better data scalability. It also narrows the…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Models

🤗
GSAI-ML/LLaDA-V
model· 4.6k dl· ♡ 26
4.6k dl♡ 26

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsMultimodal Machine Learning Applications · Domain Adaptation and Few-Shot Learning · Generative Adversarial Networks and Image Synthesis

MethodsDiffusion