Towards Long-Form Spatio-Temporal Video Grounding
Xin Gu, Bing Fan, Jiali Yao, Zhipeng Zhang, Yan Huang, Cheng Han, Heng Fan, Libo Zhang

TL;DR
This paper introduces ART-STVG, an autoregressive transformer model designed for long-form spatio-temporal video grounding, effectively handling lengthy videos by sequential processing and memory mechanisms, outperforming existing short-video methods.
Contribution
The paper presents a novel autoregressive transformer architecture with memory banks and cascaded decoding for efficient long-form video grounding, addressing limitations of existing short-video focused methods.
Findings
ART-STVG outperforms state-of-the-art methods on extended LF-STVG datasets.
The model achieves competitive results on short-form STVG.
Memory selection strategies enhance decoding relevance and performance.
Abstract
In real scenarios, videos can span several minutes or even hours. However, existing research on spatio-temporal video grounding (STVG), given a textual query, mainly focuses on localizing targets in short videos of tens of seconds, typically less than one minute, which limits real-world applications. In this paper, we explore Long-Form STVG (LF-STVG), which aims to locate targets in long-term videos. Compared with short videos, long-term videos contain much longer temporal spans and more irrelevant information, making it difficult for existing STVG methods that process all frames at once. To address this challenge, we propose an AutoRegressive Transformer architecture for LF-STVG, termed ART-STVG. Unlike conventional STVG methods that require the entire video sequence to make predictions at once, ART-STVG treats the video as streaming input and processes frames sequentially, enabling…
Peer Reviews
Decision·ICLR 2026 Conference Withdrawn Submission
1.The motivation of the proposed task and the model design is intuitively reasonable and technically sound. 2.LF-STVG is a practical and promising new direction that opens up new opportunities in localization-oriented multimodal video understanding area. 3.Extensive experiments verify the drawbacks of previous SF-STVG methods in handling long video scenarios and the proposed method provides a strong baseline in long-form STVG.
1.One of my biggest concerns regarding this work is the lack of the tailored training data for LF-STVG. As mentioned in the manuscript, the authors extended the validation set of a short-form STVG dataset HCSTVG-v2 to several minutes, and the experimental results are reported for models trained on existing short-form STVG data. This limitation decreases the contribution of this work as I think LF-STVG is a very challenging task and needs tailored training data to facilitate its development in th
1. This paper introduces ART-STVG for long-form spatio-temporal video grounding (LF-STVG), using frame-wise autoregressive decoding to mitigate the memory burden and irrelevant distractions of parallel full-clip processing. 2. It proposes novel spatio-temporal memory selection strategies: spatially, it selects context by comparing the text query with memory slots; temporally, it segments by comparing adjacent memory slots and retrieves the most recent segment—an effective design for long-video s
1. All experiments are conducted on the extended HC-STVG-v2 long-video validation set, with no evaluation on other grounding benchmarks such as VidSTG [A] (short/long). The model’s generalization therefore remains to be further verified. 2. The structural novelty is moderate. The adopted mechanisms—multimodal feature extraction and fusion, selective memory, and memory banks—have been widely used in video QA, temporal localization, and long-video modeling. The paper’s main contribution lies in ta
1. It is well-motivated to be the first to explore the LF-STVG problem and propose the first framework attempting to handle LF-STVG. 2. It achieves SOTA results on extended datasets for LF-STVG, while maintains competitive results on SF-STVG. 3. The writing is clear, and the presentation of figures and diagrams is good.
1. Some ablation studies are missing. For example, what is the impact of the number of selective temporal memories? What is the impact of the order of spatial decoder and temporal decoder? 2. More detailed discussion should be complemented. Since ART-STVG trades time for space by ingesting frames one at a time, detailed discussion on the trade-off between space and time is valuable to verify the balance in practical scenes. 3. More explanation should be given. Generally, longer videos bring ri
- This paper introduces a new task setting, i.e., long-form spatio-temporal video grounding (LF-STVG). - The authors also extend HCSTVG-v2 to 1–5-minute validation and report large gains over TubeDETR, STCAT, CG-STVG, and TA-STVG on these long-form sets. - This paper is easy to understand.
- Although the paper frames LF-STVG as long-form, the maximum evaluated duration (~5 minutes) is modest when compared to datasets like MAD, where the average video spans ~110.8 minutes [a]. Moreover, OmniSTVG [b] investigates long-form conditions and multi-object detection, thereby encompassing a more comprehensive setting. - ART-STVG seems to assemble known techniques (self-attention fusion, memory banks, selection mechanisms), and the specific novelty remains unclear. - This paper claims that
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsMultimodal Machine Learning Applications · Human Pose and Action Recognition · Video Analysis and Summarization
