TL;DR
This paper introduces a new task called spoken video grounding, utilizing a novel curriculum learning approach that leverages visual cues to improve understanding of noisy spoken language in videos.
Contribution
The paper proposes a video-guided curriculum learning method for spoken video grounding and creates the first large-scale dataset for this task, enhancing model robustness in noisy environments.
Findings
VGCL improves pre-training efficiency and grounding accuracy.
Model outperforms ASR-based methods in noisy conditions.
Introduces the ActivityNet Speech dataset for large-scale evaluation.
Abstract
In this paper, we introduce a new task, spoken video grounding (SVG), which aims to localize the desired video fragments from spoken language descriptions. Compared with using text, employing audio requires the model to directly exploit the useful phonemes and syllables related to the video from raw speech. Moreover, we randomly add environmental noises to this speech audio, further increasing the difficulty of this task and better simulating real applications. To rectify the discriminative phonemes and extract video-related information from noisy audio, we develop a novel video-guided curriculum learning (VGCL) during the audio pre-training process, which can make use of the vital visual perceptions to help understand the spoken language and suppress the external noise. Considering during inference the model can not obtain ground truth video segments, we design a curriculum strategy…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
MethodsContrastive Language-Image Pre-training
