TL;DR
DiffusionVL introduces a method to convert pretrained autoregressive vision language models into diffusion models, achieving high performance with less data and faster inference, bridging the gap between AR and diffusion approaches.
Contribution
The paper presents an efficient diffusion finetuning process to transform existing AR models into diffusion vision language models without changing architecture.
Findings
DiffusionVL achieves 34.4% improvement on MMMU-Pro benchmark.
It attains 37.5% gain on MME benchmark.
The method doubles inference speed compared to prior approaches.
Abstract
Diffusion-based decoding has recently emerged as an appealing alternative to autoregressive (AR) generation, offering the potential to update multiple tokens in parallel and reduce latency. However, diffusion vision language models (dVLMs) still lag significantly behind mainstream autoregressive vision language models. This is due to the scarcity and weaker performance of base diffusion language models (dLLMs) compared with their autoregressive counterparts. This raises a natural question: Can we build high-performing dVLMs directly from existing powerful AR models, without relying on dLLMs? We propose DiffusionVL, a family of dVLMs obtained by translating pretrained AR models into the diffusion paradigm via an efficient diffusion finetuning procedure that changes the training objective and decoding process while keeping the backbone architecture intact. Through an efficient diffusion…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
