Hierarchical Local-Global Transformer for Temporal Sentence Grounding
Xiang Fang, Daizong Liu, Pan Zhou, Zichuan Xu, Ruixuan Li

TL;DR
This paper introduces a Hierarchical Local-Global Transformer that models multi-level semantic interactions between video segments and query phrases, significantly improving temporal sentence grounding accuracy.
Contribution
The novel HLGT framework captures hierarchical semantic interactions and introduces a cross-modal cycle-consistency loss for enhanced multi-modal reasoning.
Findings
Achieves state-of-the-art results on three datasets.
Effectively models local and global semantic dependencies.
Improves grounding accuracy over existing methods.
Abstract
This paper studies the multimedia problem of temporal sentence grounding (TSG), which aims to accurately determine the specific video segment in an untrimmed video according to a given sentence query. Traditional TSG methods mainly follow the top-down or bottom-up framework and are not end-to-end. They severely rely on time-consuming post-processing to refine the grounding results. Recently, some transformer-based approaches are proposed to efficiently and effectively model the fine-grained semantic alignment between video and query. Although these methods achieve significant performance to some extent, they equally take frames of the video and words of the query as transformer input for correlating, failing to capture their different levels of granularity with distinct semantics. To address this issue, in this paper, we propose a novel Hierarchical Local-Global Transformer (HLGT) to…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsMultimodal Machine Learning Applications · Subtitles and Audiovisual Media · Video Analysis and Summarization
MethodsMulti-Head Attention · Attention Is All You Need · Linear Layer · Absolute Position Encodings · Layer Normalization · Softmax · Adam · Position-Wise Feed-Forward Layer · Dropout · Dense Connections
