VideoLLM-MoD: Efficient Video-Language Streaming with Mixture-of-Depths Vision Computation
Shiwei Wu, Joya Chen, Kevin Qinghong Lin, Qimeng Wang, Yan Gao, Qianli, Xu, Tong Xu, Yao Hu, Enhong Chen, Mike Zheng Shou

TL;DR
VideoLLM-MoD introduces a mixture-of-depths approach that selectively skips computation for most vision tokens in each transformer layer, significantly reducing resource usage while maintaining or improving performance in long-term video understanding tasks.
Contribution
The paper proposes a novel skip-layer technique inspired by mixture-of-depths LLMs to efficiently handle dense vision tokens in streaming video without sacrificing accuracy.
Findings
Achieves approximately 42% time and 30% memory savings during training.
Maintains or improves performance on multiple video understanding benchmarks.
Demonstrates state-of-the-art results in narration, forecasting, and summarization tasks.
Abstract
A well-known dilemma in large vision-language models (e.g., GPT-4, LLaVA) is that while increasing the number of vision tokens generally enhances visual understanding, it also significantly raises memory and computational costs, especially in long-term, dense video frame streaming scenarios. Although learnable approaches like Q-Former and Perceiver Resampler have been developed to reduce the vision token burden, they overlook the context causally modeled by LLMs (i.e., key-value cache), potentially leading to missed visual cues when addressing user queries. In this paper, we introduce a novel approach to reduce vision compute by leveraging redundant vision tokens "skipping layers" rather than decreasing the number of vision tokens. Our method, VideoLLM-MoD, is inspired by mixture-of-depths LLMs and addresses the challenge of numerous vision tokens in long-term or streaming video.…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
Taxonomy
TopicsAdvanced Data Compression Techniques · Advanced Image and Video Retrieval Techniques · Video Analysis and Summarization
MethodsAttention Is All You Need · Linear Layer · Adam · Layer Normalization · Position-Wise Feed-Forward Layer · Dense Connections · Residual Connection · Multi-Head Attention · Byte Pair Encoding · Absolute Position Encodings
