HERO: Hierarchical Embedding-Refinement for Open-Vocabulary Temporal Sentence Grounding in Videos

Tingting Han; Xinsong Tao; Yufei Yin; Min Tan; Sicheng Zhao; Zhou Yu

arXiv:2603.06732·cs.CV·March 10, 2026

HERO: Hierarchical Embedding-Refinement for Open-Vocabulary Temporal Sentence Grounding in Videos

Tingting Han, Xinsong Tao, Yufei Yin, Min Tan, Sicheng Zhao, Zhou Yu

PDF

Open Access

TL;DR

This paper introduces the OV-TSGV task and benchmarks, and proposes HERO, a hierarchical embedding-refinement framework that improves open-vocabulary temporal sentence grounding in videos, demonstrating superior generalization over existing methods.

Contribution

The paper defines the new OV-TSGV task, creates benchmarks, and proposes HERO, a novel hierarchical embedding-refinement model for improved open-vocabulary video grounding.

Findings

01

HERO outperforms state-of-the-art methods on OV-TSGV benchmarks.

02

HERO demonstrates strong generalization to unseen vocabulary.

03

The benchmarks facilitate systematic evaluation of open-vocabulary grounding.

Abstract

Temporal Sentence Grounding in Videos (TSGV) aims to temporally localize segments of a video that correspond to a given natural language query. Despite recent progress, most existing TSGV approaches operate under closed-vocabulary settings, limiting their ability to generalize to real-world queries involving novel or diverse linguistic expressions. To bridge this critical gap, we introduce the Open-Vocabulary TSGV (OV-TSGV) task and construct the first dedicated benchmarks--Charades-OV and ActivityNet-OV--that simulate realistic vocabulary shifts and paraphrastic variations. These benchmarks facilitate systematic evaluation of model generalization beyond seen training concepts. To tackle OV-TSGV, we propose HERO(Hierarchical Embedding-Refinement for Open-Vocabulary grounding), a unified framework that leverages hierarchical linguistic embeddings and performs parallel cross-modal…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsMultimodal Machine Learning Applications · Generative Adversarial Networks and Image Synthesis · Human Pose and Action Recognition