TL;DR
BARD introduces a bridging framework that efficiently converts autoregressive vision-language models into diffusion models, enhancing decoding speed and maintaining high performance through stage-wise distillation and block merging.
Contribution
The paper proposes a novel method combining progressive block merging and stage-wise distillation to convert autoregressive VLMs into diffusion VLMs with improved efficiency and performance.
Findings
BARD achieves up to 3× decoding speedup over the source autoregressive model.
It establishes new state-of-the-art results among comparable-scale open diffusion VLMs.
Stage-wise intra-diffusion distillation effectively recovers performance lost at larger decoding blocks.
Abstract
Autoregressive vision-language models (VLMs) deliver strong multimodal capability, but their token-by-token decoding imposes a fundamental inference bottleneck. Diffusion VLMs offer a more parallel decoding paradigm, yet directly converting a pretrained autoregressive VLM into a large-block diffusion VLM (dVLM) often leads to substantial quality degradation. In this work, we present BARD, a simple and effective bridging framework that converts a pretrained autoregressive VLM into a same-architecture, decoding-efficient dVLM. Our approach combines progressive supervised block merging, which gradually enlarges the decoding block size, with stage-wise intra-dVLM distillation from a fixed small-block diffusion anchor to recover performance lost at larger blocks. We further incorporate a mixed noise scheduler to improve robustness and token revision during denoising, and memory-friendly…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
- 🤗fudan-generative-ai/Bard-VL-B4-Mask-8B-Instructmodel· 37 dl· ♡ 137 dl♡ 1
- 🤗fudan-generative-ai/Bard-VL-B4-Mask-4B-Instructmodel· 87 dl· ♡ 187 dl♡ 1
- 🤗fudan-generative-ai/Bard-VL-B4-Mask-2B-Instructmodel· 151 dl· ♡ 2151 dl♡ 2
- 🤗fudan-generative-ai/Bard-VL-B8-Mask-4B-Distil-Instructmodel· 77 dl· ♡ 177 dl♡ 1
- 🤗fudan-generative-ai/Bard-VL-B16-Mask-4B-Distil-Instructmodel· 69 dl· ♡ 169 dl♡ 1
- 🤗fudan-generative-ai/Bard-VL-B32-Mask-4B-Distil-Instructmodel· 76 dl· ♡ 176 dl♡ 1
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
