Temporal Grounding as a Learning Signal for Referring Video Object Segmentation
Seunghun Lee, Jiwan Seo, Jeonghoon Kim, Sungho Moon, Siwon Kim, Haeun Yun, Hyogyeong Jeon, Wonhyeok Choi, Jaehoon Jeong, Zane Durante, Sang Hyun Park, Sunghoon Im

TL;DR
This paper introduces a novel temporal grounding framework for referring video object segmentation, utilizing explicit temporal annotations to improve semantic alignment and achieve state-of-the-art results.
Contribution
It proposes Temporally Grounded Learning (TGL), incorporating temporal supervision via new strategies like MDP and OSS, addressing semantic misalignment issues in RVOS.
Findings
Achieves new state-of-the-art on MeViS benchmark.
Effectively leverages temporal annotations for improved segmentation.
Reduces semantic noise during training.
Abstract
Referring Video Object Segmentation (RVOS) aims to segment and track objects in videos based on natural language expressions, requiring precise alignment between visual content and textual queries. However, existing methods often suffer from semantic misalignment, largely due to indiscriminate frame sampling and supervision of all visible objects during training -- regardless of their actual relevance to the expression. We identify the core problem as the absence of an explicit temporal learning signal in conventional training paradigms. To address this, we introduce MeViS-M, a dataset built upon the challenging MeViS benchmark, where we manually annotate temporal spans when each object is referred to by the expression. These annotations provide a direct, semantically grounded supervision signal that was previously missing. To leverage this signal, we propose Temporally Grounded…
Peer Reviews
Decision·ICLR 2026 Conference Withdrawn Submission
- The proposed approach outperforms previous approaches on multiple datasets at multiple metrics and settings.
- Architecture Novelty: The novelty looks limited with focus more on filtering the dataset than developing/modifying the architecture. - Section 4.2: Design Motivation: For irrelevant frames usage of direct raw predictions from SAM - wouldn’t SAM output masks for all the objects present in the scene? How would the approach filter the irrelevant masks? - Section 4.3: Motion-Guided Propagation: Why are M- frames being passed at train time or used if not utilized at all? I didn’t get this
1. The authors introduce a new dataset MeViS-M with additional temporal annotations. 2. The proposed TGL framework achieves state-of-the-art performance on the challenging MeViS benchmark.
1. The paper presents a new approach TGL. However, it mainly introduces additional supervisory signals into the dataset. The MDP and OSS modules in TGL seem to perform selective memory computation or mask filtering based on these auxiliary signals, which may somewhat limit the overall novelty. 2. The proposed method appears to rely mainly on the frame-level annotations provided by the temporal grounding (i.e., whether the target appears in the current frame), which may be somewhat similar to p
* **Valuable Dataset Contribution**: MeViS-M provides carefully curated temporal annotations with corrections to the original MeViS (fixing ID errors, adding missing objects, correcting masks as shown in Figure 4). This refined benchmark will benefit the research community. * **Well-Articulated Problem**: The paper clearly identifies a fundamental issue—training models to segment "jumping cats" while supervising frames where cats sit still creates contradictory learning signals. This motivation
* **Unfair Experimental Comparisons**: All baseline methods were trained on original MeViS while TGL uses MeViS-M with corrected annotations and temporal labels. This confounds two factors: (1) cleaner training data and (2) novel methodology. To fairly assess the contribution of MDP and OSS, baselines (especially SAMWISE) should also be trained on MeViS-M with moment-aware sampling. Additionally, Table 1 shows SAMWISE with VLMs (†) performs worse than without (48.9 vs 49.5), contradicting the pr
- The paper is clearly written and well-structured; the figures and tables aid comprehension, and the methodology is logically sound and presented. - The proposed temporal grounding strategy effectively achieves leading performance on the newly introduced MEVIS-M benchmark.
- While the authors argue that using purely visual features (FSAM) in M- can mitigate semantic contamination, textual features in M+ are still required for memory queries during inference. Does this inconsistency across feature spaces induce the accumulation of cross-modal propagation errors? - OSS assumes that each target object in a video corresponds to a specific time interval described by language. However, in real-world scenarios, linguistic expressions are often ambiguous, polysemous, or s
1. The paper is well-written and clearly structured, making the proposed methodology and contributions easy to understand. 2. The ablation studies validate the effectiveness of the individual components (MDP and OSS) within the proposed TGL framework.
1. The novelty of the proposed method appears somewhat limited. The most significant contribution seems to be the manually annotated MeViS-M dataset. While valuable, this approach is labor-intensive and may not be easily scalable. Furthermore, the testing strategy of directly using VLMs to select top-k frames is relatively straightforward and may lack sophistication. 2. Table 1 would be strengthened by including comparisons with other methods (such as SAMWISE and GLUS) under the setting of 'usi
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsVideo Analysis and Summarization · Multimodal Machine Learning Applications · Advanced Image and Video Retrieval Techniques
