Bridging Time and Space: Decoupled Spatio-Temporal Alignment for Video Grounding

Xuezhen Tu; Jingyu Wu; Fangyu Kang; Qingpeng Nong; Kaijin Zhang; Chaoyue Niu; Fan Wu

arXiv:2604.08014·cs.CV·April 22, 2026

Bridging Time and Space: Decoupled Spatio-Temporal Alignment for Video Grounding

Xuezhen Tu, Jingyu Wu, Fangyu Kang, Qingpeng Nong, Kaijin Zhang, Chaoyue Niu, Fan Wu

PDF

TL;DR

This paper introduces Bridge-STG, a novel framework that decouples spatio-temporal localization in video grounding, effectively addressing entanglement and redundancy issues to achieve state-of-the-art results.

Contribution

Bridge-STG is the first end-to-end model to decouple temporal and spatial localization with semantic bridging and query-guided modules, improving performance on video grounding tasks.

Findings

01

Achieves state-of-the-art m_vIoU of 34.3 on VidSTG benchmark.

02

Significantly improves cross-task transfer in fine-grained video understanding.

03

Effectively reduces visual token redundancy in spatio-temporal grounding.

Abstract

Spatio-Temporal Video Grounding requires jointly localizing target objects across both temporal and spatial dimensions based on natural language queries, posing fundamental challenges for existing Multimodal Large Language Models (MLLMs). We identify two core challenges: \textit{entangled spatio-temporal alignment}, arising from coupling two heterogeneous sub-tasks within the same autoregressive output space, and \textit{dual-domain visual token redundancy}, where target objects exhibit simultaneous temporal and spatial sparsity, rendering the overwhelming majority of visual tokens irrelevant to the grounding query. To address these, we propose \textbf{Bridge-STG}, an end-to-end framework that decouples temporal and spatial localization while maintaining semantic coherence. While decoupling is the natural solution to this entanglement, it risks creating a semantic gap between the…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.