BARD: Bridging AutoRegressive and Diffusion Vision-Language Models Via Highly Efficient Progressive Block Merging and Stage-Wise Distillation

Baoyou Chen; Hanchen Xia; Peng Tu; Haojun Shi; Liwei Zhang; Weihao Yuan; Siyu Zhu

arXiv:2604.16514·cs.CV·April 28, 2026

BARD: Bridging AutoRegressive and Diffusion Vision-Language Models Via Highly Efficient Progressive Block Merging and Stage-Wise Distillation

Baoyou Chen, Hanchen Xia, Peng Tu, Haojun Shi, Liwei Zhang, Weihao Yuan, Siyu Zhu

PDF

1 Repo 6 Models

TL;DR

BARD introduces a bridging framework that efficiently converts autoregressive vision-language models into diffusion models, enhancing decoding speed and maintaining high performance through stage-wise distillation and block merging.

Contribution

The paper proposes a novel method combining progressive block merging and stage-wise distillation to convert autoregressive VLMs into diffusion VLMs with improved efficiency and performance.

Findings

01

BARD achieves up to 3× decoding speedup over the source autoregressive model.

02

It establishes new state-of-the-art results among comparable-scale open diffusion VLMs.

03

Stage-wise intra-diffusion distillation effectively recovers performance lost at larger decoding blocks.

Abstract

Autoregressive vision-language models (VLMs) deliver strong multimodal capability, but their token-by-token decoding imposes a fundamental inference bottleneck. Diffusion VLMs offer a more parallel decoding paradigm, yet directly converting a pretrained autoregressive VLM into a large-block diffusion VLM (dVLM) often leads to substantial quality degradation. In this work, we present BARD, a simple and effective bridging framework that converts a pretrained autoregressive VLM into a same-architecture, decoding-efficient dVLM. Our approach combines progressive supervised block merging, which gradually enlarges the decoding block size, with stage-wise intra-dVLM distillation from a fixed small-block diffusion anchor to recover performance lost at larger blocks. We further incorporate a mixed noise scheduler to improve robustness and token revision during denoising, and memory-friendly…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

fudan-generative-vision/Bard-VL
github

Models

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.