DiffVC: A Non-autoregressive Framework Based on Diffusion Model for Video Captioning

Junbo Wang; Liangyu Fu; Yuke Li; Yining Zhu; Ya Jing; Xuecheng Wu; Jiangbin Zheng

arXiv:2604.08084·cs.CV·April 10, 2026

DiffVC: A Non-autoregressive Framework Based on Diffusion Model for Video Captioning

Junbo Wang, Liangyu Fu, Yuke Li, Yining Zhu, Ya Jing, Xuecheng Wu, Jiangbin Zheng

PDF

TL;DR

DiffVC introduces a non-autoregressive diffusion-based framework for video captioning, enabling faster generation and higher quality descriptions by leveraging parallel decoding and discriminative denoising.

Contribution

It proposes a novel diffusion model approach for video captioning that overcomes speed and quality limitations of existing methods.

Findings

01

Outperforms previous non-autoregressive methods

02

Achieves comparable performance to autoregressive models

03

Faster caption generation with improved metrics

Abstract

Current video captioning methods usually use an encoder-decoder structure to generate text autoregressively. However, autoregressive methods have inherent limitations such as slow generation speed and large cumulative error. Furthermore, the few non-autoregressive counterparts suffer from deficiencies in generation quality due to the lack of sufficient multimodal interaction modeling. Therefore, we propose a non-autoregressive framework based on Diffusion model for Video Captioning (DiffVC) to address these issues. Its parallel decoding can effectively solve the problems of generation speed and cumulative error. At the same time, our proposed discriminative conditional Diffusion Model can generate higher-quality textual descriptions. Specifically, we first encode the video into a visual representation. During training, Gaussian noise is added to the textual representation of the…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.