Fine-grained Semantic Alignment Network for Weakly Supervised Temporal   Language Grounding

Yuechen Wang; Wengang Zhou; Houqiang Li

arXiv:2210.11933·cs.CV·October 24, 2022

Fine-grained Semantic Alignment Network for Weakly Supervised Temporal Language Grounding

Yuechen Wang, Wengang Zhou, Houqiang Li

PDF

TL;DR

This paper introduces FSAN, a novel weakly supervised method for temporal language grounding that learns fine-grained token-to-clip semantic alignment, improving localization accuracy without requiring detailed annotations.

Contribution

The paper proposes a candidate-free, token-level semantic alignment framework that captures temporal structure and complex semantics, advancing weakly supervised TLG methods.

Findings

01

Achieves state-of-the-art results on ActivityNet-Captions.

02

Outperforms existing weakly supervised methods on DiDeMo.

03

Demonstrates effectiveness of fine-grained semantic alignment.

Abstract

Temporal language grounding (TLG) aims to localize a video segment in an untrimmed video based on a natural language description. To alleviate the expensive cost of manual annotations for temporal boundary labels, we are dedicated to the weakly supervised setting, where only video-level descriptions are provided for training. Most of the existing weakly supervised methods generate a candidate segment set and learn cross-modal alignment through a MIL-based framework. However, the temporal structure of the video as well as the complicated semantics in the sentence are lost during the learning. In this work, we propose a novel candidate-free framework: Fine-grained Semantic Alignment Network (FSAN), for weakly supervised TLG. Instead of view the sentence and candidate moments as a whole, FSAN learns token-by-clip cross-modal semantic alignment by an iterative cross-modal interaction…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.