Video-QTR: Query-Driven Temporal Reasoning Framework for Lightweight Video Understanding

Xinkui Zhao; Zuxin Wang; Yifan Zhang; Guanjie Cheng; Yueshen Xu; Shuiguang Deng; Chang Liu; Naibo Wang; Jianwei Yin

arXiv:2512.09354·cs.CV·December 11, 2025

Video-QTR: Query-Driven Temporal Reasoning Framework for Lightweight Video Understanding

Xinkui Zhao, Zuxin Wang, Yifan Zhang, Guanjie Cheng, Yueshen Xu, Shuiguang Deng, Chang Liu, Naibo Wang, Jianwei Yin

PDF

Open Access

TL;DR

Video-QTR introduces a query-driven, adaptive framework for long-video understanding that reduces computational load and achieves state-of-the-art results by focusing on relevant frames based on semantic queries.

Contribution

It proposes a novel lightweight, query-guided reasoning framework that dynamically allocates perceptual resources, improving efficiency and scalability in long-video comprehension.

Findings

01

Reduces input frame consumption by up to 73%.

02

Achieves state-of-the-art performance on five benchmarks.

03

Demonstrates effective adaptive resource allocation based on queries.

Abstract

The rapid development of multimodal large-language models (MLLMs) has significantly expanded the scope of visual language reasoning, enabling unified systems to interpret and describe complex visual content. However, applying these models to long-video understanding remains computationally intensive. Dense frame encoding generates excessive visual tokens, leading to high memory consumption, redundant computation, and limited scalability in real-world applications. This inefficiency highlights a key limitation of the traditional process-then-reason paradigm, which analyzes visual streams exhaustively before semantic reasoning. To address this challenge, we introduce Video-QTR (Query-Driven Temporal Reasoning), a lightweight framework that redefines video comprehension as a query-guided reasoning process. Instead of encoding every frame, Video-QTR dynamically allocates perceptual…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsMultimodal Machine Learning Applications · Generative Adversarial Networks and Image Synthesis · Explainable Artificial Intelligence (XAI)