Test-Time Temporal Sampling for Efficient MLLM Video Understanding

Kaibin Wang; Mingbao Lin

arXiv:2511.17945·cs.CV·November 25, 2025

Test-Time Temporal Sampling for Efficient MLLM Video Understanding

Kaibin Wang, Mingbao Lin

PDF

Open Access

TL;DR

This paper introduces T3S, a training-free inference method that efficiently processes long videos with multimodal large language models by sampling multiple short subsequences and aggregating predictions, improving speed and accuracy.

Contribution

T3S is a novel, inference-time, plug-and-play approach that reduces computational complexity and enhances long-video understanding without model retraining or fine-tuning.

Findings

01

Improves accuracy by up to 3.1% on benchmarks.

02

Reduces inference delay by 2.04 times.

03

Operates with minimal integration effort.

Abstract

Processing long videos with multimodal large language models (MLLMs) poses a significant computational challenge, as the model's self-attention mechanism scales quadratically with the number of video tokens, resulting in high computational demand and slow inference speed. Current solutions, such as rule-based sub-sampling, learned frame selector, or memory-based summarization, often introduce their own trade-offs: they compromise accuracy, necessitate additional training, or decrease inference speed. In this paper, we propose Test-Time Temporal Sampling (T3S), a training-free, plug-and-play inference wrapper that enables MLLMs to process long videos both efficiently and effectively. T3S exploits spatiotemporal redundancy by generating multiple short and diverse subsequences of video tokens at inference time, packing them within a single forward pass, and aggregating their predictions.…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsGenerative Adversarial Networks and Image Synthesis · Multimodal Machine Learning Applications · Machine Learning in Healthcare