Invert4TVG: A Temporal Video Grounding Framework with Inversion Tasks Preserving Action Understanding Ability
Zhaoyu Chen, Hongnan Lin, Yongwei Nie, Fei Ma, Xuemiao Xu, Fei Yu, Chengjiang Long

TL;DR
This paper introduces Invert4TVG, a novel framework for temporal video grounding that incorporates inversion tasks to enhance action understanding, leading to improved grounding accuracy on benchmark datasets.
Contribution
The paper proposes integrating inversion-based auxiliary tasks into TVG models to better preserve action understanding during training.
Findings
Achieves 7.1% improvement in [email protected] on Charades-STA
Outperforms state-of-the-art methods
Demonstrates effectiveness of inversion tasks in preserving action understanding
Abstract
Temporal Video Grounding (TVG) aims to localize video segments corresponding to a given textual query, which often describes human actions. However, we observe that current methods, usually optimizing for high temporal Intersection-over-Union (IoU), frequently struggle to accurately recognize or understand the underlying actions in both the video and query, thus reducing the effectiveness of these methods. To address this, we propose a novel TVG framework that integrates inversion-based TVG as auxiliary objectives to maintain the model's action understanding ability. We introduce three kinds of inversion TVG tasks derived from the original TVG annotations: (1) Verb Completion, predicting masked verbs (actions) in queries given video segments; (2) Action Recognition, identifying query-described actions; and (3) Video Description, generating descriptions containing query-relevant actions…
Peer Reviews
Decision·ICLR 2026 Poster
1) **Clear diagnosis and motivation.** The paper convincingly shows that **IoU-centric optimization can erode action understanding**, with Figure 1 and accompanying evidence making the point concrete. The “buttoning vs. unbuttoning” example effectively illustrates the failure mode. 2) **Strong empirical gains.** The method reaches **44.0% [email protected] on Charades-STA**—a **+7.1** point improvement over Time-R1—and delivers **consistent improvements across Charades-STA, ActivityNet, and QVHighli
1) **Incremental novelty via training reformulation.** The main ingredients—GRPO-based RL, inversion-style auxiliary tasks, and template/format rewards—are adapted from existing ideas. The contribution lies primarily in **recasting TVG training** rather than introducing a fundamentally new algorithmic primitive. 2) **Verb-centric semantics may be brittle.** Reliance on **SpaCy verb lemmatization** for VC/AR/VD risks overlooking **non-verbal cues** (objects, states) and **nuanced modifiers
1.The paper introduces the unique "Inversion TVG Tasks" combined with a dynamically probabilistic reinforcement learning framework, cleverly leveraging existing data for self-supervised action understanding, providing a novel problem-solving and execution strategy for the TVG field. 2.The methodology is rigorously designed, especially in the three multi-granularity inversion tasks and their reward functions. Experiments are comprehensive, yielding significantly superior SOTA performance across m
1.The paper’s central claim is that "by reversing the task, the model’s action-understanding ability is preserved and even enhanced," and this improvement is presented as the reason for the superior Temporal Video Grounding (TVG) performance. However, throughout the experimental section the authors only report higher TVG localization metrics (R1@m). They never directly evaluate the final Invert4TVG model on the action-understanding tasks (i.e., the three reversed tasks: VC, AR, and VD) on the te
+ The idea of reversing the TVG process to construct self-supervised auxiliary objectives is conceptually fresh and well-motivated. + The method is compatible with large LVLMs and scalable to different model sizes (3B and 7B). + The article is written in a prominent style, making it easy for readers to grasp the core points.
+ The authors claim that “existing TVGs over-optimize IoU, leading to semantic degradation”, but this paper lacks the experimental analysis of IoU improvement. + The datasets used in TVG and Invert-TVG duplicates? Is the Invert-TVG more like a video QA task? + Limited performance improvement compared to Time-R1. There are no related ablation experiments for parameter p=0.8 in the main paper.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsMultimodal Machine Learning Applications · Human Pose and Action Recognition · Generative Adversarial Networks and Image Synthesis
