Think-Clip-Sample: Slow-Fast Frame Selection for Video Understanding

Wenhui Tan; Ruihua Song; Jiaze Li; Jianzhong Ju; Zhenbo Luo

arXiv:2601.11359·cs.CV·January 19, 2026

Think-Clip-Sample: Slow-Fast Frame Selection for Video Understanding

Wenhui Tan, Ruihua Song, Jiaze Li, Jianzhong Ju, Zhenbo Luo

PDF

Open Access

TL;DR

Think-Clip-Sample (TCS) is a training-free framework that improves long video understanding by adaptively balancing local details and global context, significantly boosting accuracy and efficiency across multiple models.

Contribution

TCS introduces a novel clip-level slow-fast sampling method combined with multi-query reasoning, enhancing long video comprehension without additional training.

Findings

01

Boosts accuracy up to 6.9% on benchmarks

02

Reduces inference time by 50% while maintaining accuracy

03

Effective across various multi-modal large language models

Abstract

Recent progress in multi-modal large language models (MLLMs) has significantly advanced video understanding. However, their performance on long-form videos remains limited by computational constraints and suboptimal frame selection. We present Think-Clip-Sample (TCS), a training-free framework that enhances long video understanding through two key components: (i) Multi-Query Reasoning, which generates multiple queries to capture complementary aspects of the question and video; and (ii) Clip-level Slow-Fast Sampling, which adaptively balances dense local details and sparse global context. Extensive experiments on MLVU, LongVideoBench, and VideoMME demonstrate that TCS consistently improves performance across different MLLMs, boosting up to 6.9% accuracy, and is capable of achieving comparable accuracy with 50% fewer inference time cost, highlighting both efficiency and efficacy of TCS on…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsMultimodal Machine Learning Applications · Domain Adaptation and Few-Shot Learning · Human Pose and Action Recognition