VideoITG: Multimodal Video Understanding with Instructed Temporal Grounding

Shihao Wang; Guo Chen; De-an Huang; Zhiqi Li; Minghan Li; Guilin Liu; Jose M. Alvarez; Lei Zhang; Zhiding Yu

arXiv:2507.13353·cs.CV·March 18, 2026

VideoITG: Multimodal Video Understanding with Instructed Temporal Grounding

Shihao Wang, Guo Chen, De-an Huang, Zhiqi Li, Minghan Li, Guilin Liu, Jose M. Alvarez, Lei Zhang, Zhiding Yu

PDF

Open Access 1 Models 1 Datasets 4 Reviews

TL;DR

VideoITG introduces an adaptive frame sampling framework that enhances multimodal video understanding by leveraging instruction-conditioned annotations and reasoning, significantly improving performance on various benchmarks.

Contribution

The paper presents VideoITG, a novel framework that adaptively customizes frame sampling based on instructions, along with VidThinker for annotation and a new dataset, advancing temporal grounding in videos.

Findings

01

VideoITG improves performance on multiple benchmarks.

02

VidThinker automates instruction-conditioned annotation.

03

The framework enhances temporal grounding accuracy.

Abstract

While Video Large Language Models (Video-LLMs) have shown significant potential in multimodal understanding and reasoning tasks, how to efficiently select the most informative frames from videos remains a critical challenge. Existing methods attempt to optimize frame sampling by reducing inter-frame redundancy or employing unsupervised event localization. However, these approaches often fall short in handling complex instruction-following tasks and scenarios that demand precise temporal modeling, resulting in limited performance in both semantic alignment and temporal reasoning. To address the above challenges, we introduce Instructed Temporal Grounding for Videos (VideoITG), a framework aiming to adaptively customize frame sampling strategies based on user instructions. Specifically, we design the VidThinker pipeline, which automates annotation by generating instruction-conditioned…

Peer Reviews

Decision·ICLR 2026 Conference Withdrawn Submission

Reviewer 01Rating 2Confidence 3

Strengths

1. Architectural exploration is meaningful: Fig.3 compares three alternatives under realistic constraints—text generation vs. discriminative classification, causal vs. full attention—and the ablation (Table 3) supports the preference for Variant-C. This is a useful design-space study for practitioners. 2. Data pipeline is systematic: VidThinker’s three-stage process (caption, then retrieval, and frame-level filtering) gives a coherent, interpretable way to obtain instruction-aligned labels at

Weaknesses

1. **Task definition is confusing**: The paper toggles between “temporal grounding” and “temporal localization”, and positions VideoITG as “instruction-driven temporal grounding.” However, the formal definition and evaluation protocol are not crisply specified (what exactly is a correct grounding under multi-clue instructions? how do we measure frame-level correctness vs. downstream QA?). The intro text mentions general grounding vs. localization, but an unambiguous, task-level definition and me

Reviewer 02Rating 2Confidence 4

Strengths

The following are the strengths of the paper: **1. Addresses an important and relevant problem.** The paper targets the challenge of selecting informative frames from long videos, which is a key bottleneck in Video-MLLMs. The topic is timely and valuable for improving the scalability of Video-MLLMs. **2. Reasonable annotation and training pipeline.** The proposed VidThinker pipeline, combining clip captioning, retrieval, and frame localization, forms a systematic way to create grounding data a

Weaknesses

The following are the weaknesses of the paper: **1. Lack of clarity in instruction selection mechanism.** The paper does not explain how the system decides which instruction type (semantic, motion, both, or non-clues) applies to a given question-answer pair. It is unclear whether this is done by an LLM, heuristic rules, or predefined templates. Further, how this decision differs from simply using the original question? **2. Noise accumulation through multiple stages.** Each step of annotation

Reviewer 03Rating 4Confidence 4

Strengths

1. The paper introduces Instructed Temporal Grounding, a novel task formulation that conditions frame selection on free-form user instructions rather than on fixed heuristics or single textual queries. This reframing turns frame sampling into a language-guided, task-adaptive decision problem. 2. The dataset construction is meticulous. VidThinker performs three cascaded checks (clip captioning, relevance retrieval, frame-level filtering) and integrates human-in-the-loop spot evaluations, yieldi

Weaknesses

1.The baseline model used in the article is too outdated; we would like to see results using internvl3, internvl3.5, or qwen2.5vl. 2. Limited instruction diversity and linguistic complexity. Although 500 k QA pairs are generated, the prompt templates (Appendix C.1–C.10) reveal that most questions are single-sentence, factoid-style, and temporally local (“What did X do before Y?”). There are no instructions that require multi-hop reasoning across ≥3 disjoint moments, no anaphoric references (“Aft

Reviewer 04Rating 4Confidence 4

Strengths

1. The paper addresses a real and underexplored problem: how to select frames adaptively based on user instructions, which is more flexible and practical than static or uniform sampling. 2. VideoITG-40K is currently the largest instruction-guided temporal grounding dataset, and the automated annotation pipeline (VidThinker) is well-designed and interpretable. 3. The proposed method consistently improves performance across multiple benchmarks (e.g., +9.0% on CG-Bench, +8.6% on MLVU) when integrat

Weaknesses

1. While the task is novel, the model design (e.g., anchor tokens, pooling-based classification) appears to be incremental and shares similarities with existing vision-language modeling techniques. Although the framework effectively adapts current methods to the new task, it does not introduce significant architectural breakthroughs. 2. The paper primarily focuses on demonstrating the effectiveness of its dynamic frame selection approach but does not extensively compare it with some stronger fra

Code & Models

Models

🤗
nvidia/VideoITG-8B
model· 217 dl· ♡ 7
217 dl♡ 7

Datasets

NVEagle/VideoITG-40K
dataset· 85 dl
85 dl

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsMultimodal Machine Learning Applications · Human Pose and Action Recognition · Video Analysis and Summarization