VideoLLM-MoD: Efficient Video-Language Streaming with Mixture-of-Depths   Vision Computation

Shiwei Wu; Joya Chen; Kevin Qinghong Lin; Qimeng Wang; Yan Gao; Qianli; Xu; Tong Xu; Yao Hu; Enhong Chen; Mike Zheng Shou

arXiv:2408.16730·cs.CV·August 30, 2024·2 cites

VideoLLM-MoD: Efficient Video-Language Streaming with Mixture-of-Depths Vision Computation

Shiwei Wu, Joya Chen, Kevin Qinghong Lin, Qimeng Wang, Yan Gao, Qianli, Xu, Tong Xu, Yao Hu, Enhong Chen, Mike Zheng Shou

PDF

Open Access 1 Video

TL;DR

VideoLLM-MoD introduces a mixture-of-depths approach that selectively skips computation for most vision tokens in each transformer layer, significantly reducing resource usage while maintaining or improving performance in long-term video understanding tasks.

Contribution

The paper proposes a novel skip-layer technique inspired by mixture-of-depths LLMs to efficiently handle dense vision tokens in streaming video without sacrificing accuracy.

Findings

01

Achieves approximately 42% time and 30% memory savings during training.

02

Maintains or improves performance on multiple video understanding benchmarks.

03

Demonstrates state-of-the-art results in narration, forecasting, and summarization tasks.

Abstract

A well-known dilemma in large vision-language models (e.g., GPT-4, LLaVA) is that while increasing the number of vision tokens generally enhances visual understanding, it also significantly raises memory and computational costs, especially in long-term, dense video frame streaming scenarios. Although learnable approaches like Q-Former and Perceiver Resampler have been developed to reduce the vision token burden, they overlook the context causally modeled by LLMs (i.e., key-value cache), potentially leading to missed visual cues when addressing user queries. In this paper, we introduce a novel approach to reduce vision compute by leveraging redundant vision tokens "skipping layers" rather than decreasing the number of vision tokens. Our method, VideoLLM-MoD, is inspired by mixture-of-depths LLMs and addresses the challenge of numerous vision tokens in long-term or streaming video.…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

VideoLLM-MoD: Efficient Video-Language Streaming with Mixture-of-Depths Vision Computation· slideslive

Taxonomy

TopicsAdvanced Data Compression Techniques · Advanced Image and Video Retrieval Techniques · Video Analysis and Summarization

MethodsAttention Is All You Need · Linear Layer · Adam · Layer Normalization · Position-Wise Feed-Forward Layer · Dense Connections · Residual Connection · Multi-Head Attention · Byte Pair Encoding · Absolute Position Encodings