Test-Time Temporal Sampling for Efficient MLLM Video Understanding
Kaibin Wang, Mingbao Lin

TL;DR
This paper introduces T3S, a training-free inference method that efficiently processes long videos with multimodal large language models by sampling multiple short subsequences and aggregating predictions, improving speed and accuracy.
Contribution
T3S is a novel, inference-time, plug-and-play approach that reduces computational complexity and enhances long-video understanding without model retraining or fine-tuning.
Findings
Improves accuracy by up to 3.1% on benchmarks.
Reduces inference delay by 2.04 times.
Operates with minimal integration effort.
Abstract
Processing long videos with multimodal large language models (MLLMs) poses a significant computational challenge, as the model's self-attention mechanism scales quadratically with the number of video tokens, resulting in high computational demand and slow inference speed. Current solutions, such as rule-based sub-sampling, learned frame selector, or memory-based summarization, often introduce their own trade-offs: they compromise accuracy, necessitate additional training, or decrease inference speed. In this paper, we propose Test-Time Temporal Sampling (T3S), a training-free, plug-and-play inference wrapper that enables MLLMs to process long videos both efficiently and effectively. T3S exploits spatiotemporal redundancy by generating multiple short and diverse subsequences of video tokens at inference time, packing them within a single forward pass, and aggregating their predictions.…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsGenerative Adversarial Networks and Image Synthesis · Multimodal Machine Learning Applications · Machine Learning in Healthcare
