Video-Guided Curriculum Learning for Spoken Video Grounding

Yan Xia; Zhou Zhao; Shangwei Ye; Yang Zhao; Haoyuan Li; Yi Ren

arXiv:2209.00277·cs.CV·September 2, 2022

Video-Guided Curriculum Learning for Spoken Video Grounding

Yan Xia, Zhou Zhao, Shangwei Ye, Yang Zhao, Haoyuan Li, Yi Ren

PDF

1 Repo

TL;DR

This paper introduces a new task called spoken video grounding, utilizing a novel curriculum learning approach that leverages visual cues to improve understanding of noisy spoken language in videos.

Contribution

The paper proposes a video-guided curriculum learning method for spoken video grounding and creates the first large-scale dataset for this task, enhancing model robustness in noisy environments.

Findings

01

VGCL improves pre-training efficiency and grounding accuracy.

02

Model outperforms ASR-based methods in noisy conditions.

03

Introduces the ActivityNet Speech dataset for large-scale evaluation.

Abstract

In this paper, we introduce a new task, spoken video grounding (SVG), which aims to localize the desired video fragments from spoken language descriptions. Compared with using text, employing audio requires the model to directly exploit the useful phonemes and syllables related to the video from raw speech. Moreover, we randomly add environmental noises to this speech audio, further increasing the difficulty of this task and better simulating real applications. To rectify the discriminative phonemes and extract video-related information from noisy audio, we develop a novel video-guided curriculum learning (VGCL) during the audio pre-training process, which can make use of the vital visual perceptions to help understand the spoken language and suppress the external noise. Considering during inference the model can not obtain ground truth video segments, we design a curriculum strategy…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

marmot-xy/spoken-video-grounding
pytorchOfficial

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

MethodsContrastive Language-Image Pre-training