TL;DR
LaViDa introduces a diffusion-based vision-language model that offers faster inference, controllable generation, and bidirectional reasoning, outperforming autoregressive models on key multimodal benchmarks.
Contribution
This paper presents LaViDa, a novel multimodal VLM built on diffusion models, with techniques for effective training, inference efficiency, and improved multimodal understanding.
Findings
Outperforms AR VLMs on MMMU benchmark
Surpasses Open-LLaVa-Next-8B on COCO captioning by +4.1 CIDEr
Achieves +59% improvement on Constrained Poem Completion
Abstract
Modern Vision-Language Models (VLMs) can solve a wide range of tasks requiring visual reasoning. In real-world scenarios, desirable properties for VLMs include fast inference and controllable generation (e.g., constraining outputs to adhere to a desired format). However, existing autoregressive (AR) VLMs like LLaVA struggle in these aspects. Discrete diffusion models (DMs) offer a promising alternative, enabling parallel decoding for faster inference and bidirectional context for controllable generation through text-infilling. While effective in language-only settings, DMs' potential for multimodal tasks is underexplored. We introduce LaViDa, a family of VLMs built on DMs. We build LaViDa by equipping DMs with a vision encoder and jointly fine-tune the combined parts for multimodal instruction following. To address challenges encountered, LaViDa incorporates novel techniques such as…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
MethodsDiffusion
