Bridging Time and Space: Decoupled Spatio-Temporal Alignment for Video Grounding
Xuezhen Tu, Jingyu Wu, Fangyu Kang, Qingpeng Nong, Kaijin Zhang, Chaoyue Niu, Fan Wu

TL;DR
This paper introduces Bridge-STG, a novel framework that decouples spatio-temporal localization in video grounding, effectively addressing entanglement and redundancy issues to achieve state-of-the-art results.
Contribution
Bridge-STG is the first end-to-end model to decouple temporal and spatial localization with semantic bridging and query-guided modules, improving performance on video grounding tasks.
Findings
Achieves state-of-the-art m_vIoU of 34.3 on VidSTG benchmark.
Significantly improves cross-task transfer in fine-grained video understanding.
Effectively reduces visual token redundancy in spatio-temporal grounding.
Abstract
Spatio-Temporal Video Grounding requires jointly localizing target objects across both temporal and spatial dimensions based on natural language queries, posing fundamental challenges for existing Multimodal Large Language Models (MLLMs). We identify two core challenges: \textit{entangled spatio-temporal alignment}, arising from coupling two heterogeneous sub-tasks within the same autoregressive output space, and \textit{dual-domain visual token redundancy}, where target objects exhibit simultaneous temporal and spatial sparsity, rendering the overwhelming majority of visual tokens irrelevant to the grounding query. To address these, we propose \textbf{Bridge-STG}, an end-to-end framework that decouples temporal and spatial localization while maintaining semantic coherence. While decoupling is the natural solution to this entanglement, it risks creating a semantic gap between the…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
