SlowFocus: Enhancing Fine-grained Temporal Understanding in Video LLM

Ming Nie; Dan Ding; Chunwei Wang; Yuanfan Guo; Jianhua Han; Hang Xu; Li Zhang

arXiv:2602.03589·cs.CV·February 4, 2026

SlowFocus: Enhancing Fine-grained Temporal Understanding in Video LLM

Ming Nie, Dan Ding, Chunwei Wang, Yuanfan Guo, Jianhua Han, Hang Xu, Li Zhang

PDF

Open Access

TL;DR

SlowFocus is a novel mechanism that improves fine-grained temporal understanding in video language models by combining dense local sampling with global context aggregation, leading to better temporal reasoning.

Contribution

The paper introduces SlowFocus, a new sampling and attention mechanism for Vid-LLMs, along with training strategies and a benchmark for fine-grained temporal understanding.

Findings

01

Outperforms existing models on public benchmarks.

02

Achieves superior temporal reasoning capabilities.

03

Enhances frame-level semantic retention while capturing global temporal context.

Abstract

Large language models (LLMs) have demonstrated exceptional capabilities in text understanding, which has paved the way for their expansion into video LLMs (Vid-LLMs) to analyze video data. However, current Vid-LLMs struggle to simultaneously retain high-quality frame-level semantic information (i.e., a sufficient number of tokens per frame) and comprehensive video-level temporal information (i.e., an adequate number of sampled frames per video). This limitation hinders the advancement of Vid-LLMs towards fine-grained video understanding. To address this issue, we introduce the SlowFocus mechanism, which significantly enhances the equivalent sampling frequency without compromising the quality of frame-level visual tokens. SlowFocus begins by identifying the query-related temporal segment based on the posed question, then performs dense sampling on this segment to extract local…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsMultimodal Machine Learning Applications · Topic Modeling · Domain Adaptation and Few-Shot Learning