Unlocking the Potential of Grounding DINO in Videos: Parameter-Efficient Adaptation for Limited-Data Spatial-Temporal Localization
Zanyi Wang, Fan Li, Dengyang Jiang, Liuzhuozheng Li, Yunhua Zhong, Guang Dai, Mengmeng Wang

TL;DR
This paper introduces ST-GD, a parameter-efficient framework that adapts pre-trained visual-language models like Grounding DINO for video grounding tasks with limited data, achieving strong performance without extensive retraining.
Contribution
ST-GD strategically injects lightweight adapters into frozen pre-trained models and adds a temporal decoder, enabling effective spatio-temporal localization in small-data video grounding scenarios.
Findings
Achieves competitive results on HC-STVG v1/v2 benchmarks.
Maintains robust generalization on VidSTG dataset.
Effectively counters data scarcity with minimal trainable parameters.
Abstract
Spatio-temporal video grounding (STVG) aims to localize queried objects within dynamic video segments. Prevailing fully-trained approaches are notoriously data-hungry. However, gathering large-scale STVG data is exceptionally challenging: dense frame-level bounding boxes and complex temporal language alignments are prohibitively expensive to annotate, especially for specialized video domains. Consequently, conventional models suffer from severe overfitting on these inherently limited datasets, while zero-shot foundational models lack the task-specific temporal awareness needed for precise localization. To resolve this small-data challenge, we introduce ST-GD, a data-efficient framework that adapts pre-trained 2D visual-language models (e.g., Grounding DINO) to video tasks. To avoid destroying pre-trained priors on small datasets, ST-GD keeps the base model frozen and strategically…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
