UniversalVTG: A Universal and Lightweight Foundation Model for Video Temporal Grounding
Joungbin An, Agrim Jain, Kristen Grauman

TL;DR
UniversalVTG is a lightweight, unified model trained on diverse datasets that achieves state-of-the-art video temporal grounding performance, rivaling larger models with significantly less compute.
Contribution
The paper introduces UniversalVTG, a single, scalable VTG model with a novel offline Query Unifier, outperforming specialized models and large multimodal models in accuracy and efficiency.
Findings
UniversalVTG achieves state-of-the-art results across multiple benchmarks.
It is over 100 times smaller than recent multimodal language models.
UniversalVTG matches or exceeds the accuracy of larger models.
Abstract
Video temporal grounding (VTG) is typically tackled with dataset-specific models that transfer poorly across domains and query styles. Recent efforts to overcome this limitation have adapted large multimodal language models (MLLMs) to VTG, but their high compute cost and limited video context still hinder long-video grounding. We instead scale unified supervision while keeping the model lightweight. We present UniversalVTG, a single VTG model trained with large-scale cross-dataset pretraining. An offline Query Unifier canonicalizes heterogeneous query formats into a shared declarative space, reducing linguistic mismatch and preventing the negative transfer observed under na\"ive joint training. Combined with an efficient grounding head, UniversalVTG scales to long, untrimmed videos. Across diverse benchmarks-GoalStep-StepGrounding, Ego4D-NLQ, TACoS, Charades-STA, and…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
