VideoITG: Multimodal Video Understanding with Instructed Temporal Grounding
Shihao Wang, Guo Chen, De-an Huang, Zhiqi Li, Minghan Li, Guilin Liu, Jose M. Alvarez, Lei Zhang, Zhiding Yu

TL;DR
VideoITG introduces an adaptive frame sampling framework that enhances multimodal video understanding by leveraging instruction-conditioned annotations and reasoning, significantly improving performance on various benchmarks.
Contribution
The paper presents VideoITG, a novel framework that adaptively customizes frame sampling based on instructions, along with VidThinker for annotation and a new dataset, advancing temporal grounding in videos.
Findings
VideoITG improves performance on multiple benchmarks.
VidThinker automates instruction-conditioned annotation.
The framework enhances temporal grounding accuracy.
Abstract
While Video Large Language Models (Video-LLMs) have shown significant potential in multimodal understanding and reasoning tasks, how to efficiently select the most informative frames from videos remains a critical challenge. Existing methods attempt to optimize frame sampling by reducing inter-frame redundancy or employing unsupervised event localization. However, these approaches often fall short in handling complex instruction-following tasks and scenarios that demand precise temporal modeling, resulting in limited performance in both semantic alignment and temporal reasoning. To address the above challenges, we introduce Instructed Temporal Grounding for Videos (VideoITG), a framework aiming to adaptively customize frame sampling strategies based on user instructions. Specifically, we design the VidThinker pipeline, which automates annotation by generating instruction-conditioned…
Peer Reviews
Decision·ICLR 2026 Conference Withdrawn Submission
1. Architectural exploration is meaningful: Fig.3 compares three alternatives under realistic constraints—text generation vs. discriminative classification, causal vs. full attention—and the ablation (Table 3) supports the preference for Variant-C. This is a useful design-space study for practitioners. 2. Data pipeline is systematic: VidThinker’s three-stage process (caption, then retrieval, and frame-level filtering) gives a coherent, interpretable way to obtain instruction-aligned labels at
1. **Task definition is confusing**: The paper toggles between “temporal grounding” and “temporal localization”, and positions VideoITG as “instruction-driven temporal grounding.” However, the formal definition and evaluation protocol are not crisply specified (what exactly is a correct grounding under multi-clue instructions? how do we measure frame-level correctness vs. downstream QA?). The intro text mentions general grounding vs. localization, but an unambiguous, task-level definition and me
The following are the strengths of the paper: **1. Addresses an important and relevant problem.** The paper targets the challenge of selecting informative frames from long videos, which is a key bottleneck in Video-MLLMs. The topic is timely and valuable for improving the scalability of Video-MLLMs. **2. Reasonable annotation and training pipeline.** The proposed VidThinker pipeline, combining clip captioning, retrieval, and frame localization, forms a systematic way to create grounding data a
The following are the weaknesses of the paper: **1. Lack of clarity in instruction selection mechanism.** The paper does not explain how the system decides which instruction type (semantic, motion, both, or non-clues) applies to a given question-answer pair. It is unclear whether this is done by an LLM, heuristic rules, or predefined templates. Further, how this decision differs from simply using the original question? **2. Noise accumulation through multiple stages.** Each step of annotation
1. The paper introduces Instructed Temporal Grounding, a novel task formulation that conditions frame selection on free-form user instructions rather than on fixed heuristics or single textual queries. This reframing turns frame sampling into a language-guided, task-adaptive decision problem. 2. The dataset construction is meticulous. VidThinker performs three cascaded checks (clip captioning, relevance retrieval, frame-level filtering) and integrates human-in-the-loop spot evaluations, yieldi
1.The baseline model used in the article is too outdated; we would like to see results using internvl3, internvl3.5, or qwen2.5vl. 2. Limited instruction diversity and linguistic complexity. Although 500 k QA pairs are generated, the prompt templates (Appendix C.1–C.10) reveal that most questions are single-sentence, factoid-style, and temporally local (“What did X do before Y?”). There are no instructions that require multi-hop reasoning across ≥3 disjoint moments, no anaphoric references (“Aft
1. The paper addresses a real and underexplored problem: how to select frames adaptively based on user instructions, which is more flexible and practical than static or uniform sampling. 2. VideoITG-40K is currently the largest instruction-guided temporal grounding dataset, and the automated annotation pipeline (VidThinker) is well-designed and interpretable. 3. The proposed method consistently improves performance across multiple benchmarks (e.g., +9.0% on CG-Bench, +8.6% on MLVU) when integrat
1. While the task is novel, the model design (e.g., anchor tokens, pooling-based classification) appears to be incremental and shares similarities with existing vision-language modeling techniques. Although the framework effectively adapts current methods to the new task, it does not introduce significant architectural breakthroughs. 2. The paper primarily focuses on demonstrating the effectiveness of its dynamic frame selection approach but does not extensively compare it with some stronger fra
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsMultimodal Machine Learning Applications · Human Pose and Action Recognition · Video Analysis and Summarization
