STVG-R1: Incentivizing Instance-Level Reasoning and Grounding in Videos via Reinforcement Learning
Xiaowen Zhang, Zhi Gao, Licheng Jiao, Lingling Li, Qing Li

TL;DR
This paper introduces STVG-R1, a reinforcement learning framework for dense video grounding that uses visual prompts with unique object IDs to improve accuracy and reduce annotation costs, achieving state-of-the-art results.
Contribution
It proposes a novel visual prompting method with instance-level IDs and a reinforcement learning approach for spatial-temporal grounding, surpassing existing methods in accuracy and generalization.
Findings
20.9% improvement in m_IoU on HCSTVG-v2 benchmark
State-of-the-art 47.3% J&F on MeViS zero-shot segmentation
Effective joint optimization of temporal, spatial, and structural aspects
Abstract
In vision-language models (VLMs), misalignment between textual descriptions and visual coordinates often induces hallucinations. This issue becomes particularly severe in dense prediction tasks such as spatial-temporal video grounding (STVG). Prior approaches typically focus on enhancing visual-textual alignment or attaching auxiliary decoders. However, these strategies inevitably introduce additional trainable modules, leading to significant annotation costs and computational overhead. In this work, we propose a novel visual prompting paradigm that avoids the difficult problem of aligning coordinates across modalities. Specifically, we reformulate per-frame coordinate prediction as a compact instance-level identification problem by assigning each object a unique, temporally consistent ID. These IDs are embedded into the video as visual prompts, providing explicit and interpretable…
Peer Reviews
Decision·ICLR 2026 Poster
• Novelty: Reformulating STVG from dense per-frame coordinate prediction into a "compact instance-level identification task", a novel idea that effectively avoids the difficult problem of VLMs handling coordinate prediction. • Novel RL Framework: Proposing STVG-R1, the first reinforcement learning framework for STVG, which employs a task-driven reward to optimize the VLM's reasoning. • State-of-the-art Results: Achieves new SOTA performance on multiple STVG benchmarks. • Strong Generalization
**Regarding the nature and robustness of the "visual prompting" pipeline:** • The pipeline is essentially a complex, training-free data pre-processing pipeline reliant on external SOTA models such as YOLO, SAM2, and ReID, rather than a novel model component. • The robustness of this pipeline is not discussed. For example, what happens when detection, tracking, or ReID fail? Many critical details of the pipeline, such as the arbitration logic between components, are missing, which hinders repro
Pros: 1. The paper poses a nice application of combining spatio-temporal understanding with VLMs. Spatio-temporal understanding is an important sub-topic in video understanding and how to best leverage VLMs for this task is a well motivated problem. 2. The proposed method is conceptually simple in re-using existing detection and tracking pipelines to utilize VLMs inherent understanding of vision without additional tokens or requiring VLM to do additional bounding box predictions. 3. Authors
Cons: 1. It seems the absolute improvement over previous baselines is marginal? For instance on hcstvg-v1, performance matches with space-vllm and on v2, it is slightly improved over TA-STVG. On ST-Align, it is same as LLava-ST-7B. 2. The core novelty is slightly limited, the paper suggests doing visual-prompting + grpo training works. This is good to know, but unclear what are the main challenges here. 3. One issue with the visual prompting (assuming the visualization at face value), it is
1. The writing of the paper is clear, and the illustrations in the intro section effectively explain the core contribution points of this paper. In addition, the drawing of Figures 2 and 3 is also quite intuitive. 2. The problem studied in this article (cross modal alignment) is one of the core issues in the field of video grounding, and the alignment effect between the spatial-temporal dimensions directly determines the accuracy of grounding in these two dimensions. 3. The experiment used fou
1. The last sentence of the second paragraph of the intro should be a description of the core idea, but this sentence is not clear. What does' a compact and interpretable formulation 'refer to. 2. Insufficient contribution in model design, only introducing pre-segmentation and reinforcement learning for video objects, lacking new method design for this problem. 3. The commonly used datasets in the field of Video Grounding include VidSTG and HC-STVG. This article only uses HC-STVG, and the effe
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsMultimodal Machine Learning Applications · Human Pose and Action Recognition · Visual Attention and Saliency Detection
