Think-Clip-Sample: Slow-Fast Frame Selection for Video Understanding
Wenhui Tan, Ruihua Song, Jiaze Li, Jianzhong Ju, Zhenbo Luo

TL;DR
Think-Clip-Sample (TCS) is a training-free framework that improves long video understanding by adaptively balancing local details and global context, significantly boosting accuracy and efficiency across multiple models.
Contribution
TCS introduces a novel clip-level slow-fast sampling method combined with multi-query reasoning, enhancing long video comprehension without additional training.
Findings
Boosts accuracy up to 6.9% on benchmarks
Reduces inference time by 50% while maintaining accuracy
Effective across various multi-modal large language models
Abstract
Recent progress in multi-modal large language models (MLLMs) has significantly advanced video understanding. However, their performance on long-form videos remains limited by computational constraints and suboptimal frame selection. We present Think-Clip-Sample (TCS), a training-free framework that enhances long video understanding through two key components: (i) Multi-Query Reasoning, which generates multiple queries to capture complementary aspects of the question and video; and (ii) Clip-level Slow-Fast Sampling, which adaptively balances dense local details and sparse global context. Extensive experiments on MLVU, LongVideoBench, and VideoMME demonstrate that TCS consistently improves performance across different MLLMs, boosting up to 6.9% accuracy, and is capable of achieving comparable accuracy with 50% fewer inference time cost, highlighting both efficiency and efficacy of TCS on…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsMultimodal Machine Learning Applications · Domain Adaptation and Few-Shot Learning · Human Pose and Action Recognition
