Temporal Chain of Thought: Long-Video Understanding by Thinking in Frames

Anurag Arnab; Ahmet Iscen; Mathilde Caron; Alireza Fathi; Cordelia Schmid

arXiv:2507.02001·cs.LG·July 4, 2025

Temporal Chain of Thought: Long-Video Understanding by Thinking in Frames

Anurag Arnab, Ahmet Iscen, Mathilde Caron, Alireza Fathi, Cordelia Schmid

PDF

TL;DR

The paper introduces Temporal Chain of Thought, an inference strategy that improves long-video understanding by selectively extracting relevant frames for question-answering, achieving state-of-the-art results especially on very long videos.

Contribution

It proposes a novel inference method that dynamically selects relevant frames, enhancing long-video comprehension beyond fixed context window limitations.

Findings

01

State-of-the-art results on 4 video QA datasets

02

Significant improvement on videos longer than 1 hour

03

Effective frame selection enhances accuracy with limited context

Abstract

Despite recent advances in Vision-Language Models (VLMs), long-video understanding remains a challenging problem. Although state-of-the-art long-context VLMs can process around 1000 input frames, they still struggle to effectively leverage this sequence length, and succumb to irrelevant distractors within the context window. We present Temporal Chain of Thought, an inference strategy for video question-answering that curates the model's input context. We use the VLM itself to iteratively identify and extract the most relevant frames from the video, which are then used for answering. We demonstrate how leveraging more computation at inference-time to select the most relevant context leads to improvements in accuracy, in agreement with recent work on inference-time scaling of LLMs. Moreover, we achieve state-of-the-art results on 4 diverse video question-answering datasets, showing…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.