VidLaDA: Bidirectional Diffusion Large Language Models for Efficient Video Understanding
Zhihao He, Tieyuan Chen, Kangyu Wang, Ziran Qin, Yang Shao, Chaofan Gan, Shijie Li, Zuxuan Wu, Weiyao Lin

TL;DR
VidLaDA introduces a bidirectional diffusion-based Video LLM that enhances spatiotemporal understanding and enables parallel decoding, significantly improving efficiency over traditional autoregressive models.
Contribution
It proposes VidLaDA, a novel diffusion language model for video understanding with bidirectional attention and introduces MARS-Cache for efficient parallel decoding.
Findings
Achieves comparable performance to state-of-the-art AR models.
Outperforms existing diffusion-based models in video understanding.
MARS-Cache speeds up decoding by over 12 times without accuracy loss.
Abstract
Current Video Large Language Models (Video LLMs) typically encode frames via a vision encoder and employ an autoregressive (AR) LLM for understanding and generation. However, this AR paradigm inevitably faces a dual efficiency bottleneck: strictly unidirectional attention compromises understanding efficiency by hindering global spatiotemporal aggregation, while serial decoding restricts generation efficiency. To address this, we propose VidLaDA, a Video LLM based on Diffusion Language Models (DLMs) that leverages bidirectional attention to unlock comprehensive spatiotemporal modeling and decode tokens in parallel. To further mitigate the computational overhead of diffusion decoding, we introduce MARS-Cache, an acceleration strategy that prunes redundancy by combining asynchronous visual cache refreshing with frame-wise chunk attention. Experiments show VidLaDA rivals state-of-the-art AR…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsGenerative Adversarial Networks and Image Synthesis · Multimodal Machine Learning Applications · Domain Adaptation and Few-Shot Learning
