HERO: Hierarchical Embedding-Refinement for Open-Vocabulary Temporal Sentence Grounding in Videos
Tingting Han, Xinsong Tao, Yufei Yin, Min Tan, Sicheng Zhao, Zhou Yu

TL;DR
This paper introduces the OV-TSGV task and benchmarks, and proposes HERO, a hierarchical embedding-refinement framework that improves open-vocabulary temporal sentence grounding in videos, demonstrating superior generalization over existing methods.
Contribution
The paper defines the new OV-TSGV task, creates benchmarks, and proposes HERO, a novel hierarchical embedding-refinement model for improved open-vocabulary video grounding.
Findings
HERO outperforms state-of-the-art methods on OV-TSGV benchmarks.
HERO demonstrates strong generalization to unseen vocabulary.
The benchmarks facilitate systematic evaluation of open-vocabulary grounding.
Abstract
Temporal Sentence Grounding in Videos (TSGV) aims to temporally localize segments of a video that correspond to a given natural language query. Despite recent progress, most existing TSGV approaches operate under closed-vocabulary settings, limiting their ability to generalize to real-world queries involving novel or diverse linguistic expressions. To bridge this critical gap, we introduce the Open-Vocabulary TSGV (OV-TSGV) task and construct the first dedicated benchmarks--Charades-OV and ActivityNet-OV--that simulate realistic vocabulary shifts and paraphrastic variations. These benchmarks facilitate systematic evaluation of model generalization beyond seen training concepts. To tackle OV-TSGV, we propose HERO(Hierarchical Embedding-Refinement for Open-Vocabulary grounding), a unified framework that leverages hierarchical linguistic embeddings and performs parallel cross-modal…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsMultimodal Machine Learning Applications · Generative Adversarial Networks and Image Synthesis · Human Pose and Action Recognition
