SAVE: Speech-Aware Video Representation Learning for Video-Text Retrieval
Ruixiang Zhao, Zhihao Xu, Bangxiang Lan, Zijie Xin, Jingyu Liu, Xirong Li

TL;DR
SAVE introduces a speech-aware video representation learning method that enhances video-text retrieval by effectively incorporating speech and audio features, outperforming previous audiovisual models across multiple benchmarks.
Contribution
The paper proposes a novel speech-aware video representation learning approach with a dedicated speech branch and early vision-audio alignment, improving upon state-of-the-art audiovisual methods.
Findings
SAVE outperforms AVIGATE by +4.1% on MSRVTT-9k
SAVE improves retrieval metrics on five benchmarks
Effective speech embedding enhances video-text retrieval performance
Abstract
For video-text retrieval, the use of CLIP has been a de facto choice. Since CLIP provides only image and text encoders, this consensus has led to a biased paradigm that entirely ignores the sound track of videos. While several attempts have been made to reintroduce audio -- typically by incorporating an audio encoder and fusing its output with visual features -- these methods face two challenges: ineffective representation of speech content and suboptimal vision-audio fusion. To address these issues jointly, we propose SAVE, a Speech Aware Video rEpresentation learning method. SAVE improves upon AVIGATE, a SOTA audiovisual method, with a dedicated speech branch for more effective speech embedding. Furthermore, we introduce soft-ALBEF for early vision-audio alignment that facilitates fusion. Extensive experiments on five benchmarks show that SAVE compares favorably against the SOTA,…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsSpeech and Audio Processing · Multimodal Machine Learning Applications · Music and Audio Processing
