SAVE: Speech-Aware Video Representation Learning for Video-Text Retrieval

Ruixiang Zhao; Zhihao Xu; Bangxiang Lan; Zijie Xin; Jingyu Liu; Xirong Li

arXiv:2603.08224·cs.CV·March 12, 2026

SAVE: Speech-Aware Video Representation Learning for Video-Text Retrieval

Ruixiang Zhao, Zhihao Xu, Bangxiang Lan, Zijie Xin, Jingyu Liu, Xirong Li

PDF

Open Access

TL;DR

SAVE introduces a speech-aware video representation learning method that enhances video-text retrieval by effectively incorporating speech and audio features, outperforming previous audiovisual models across multiple benchmarks.

Contribution

The paper proposes a novel speech-aware video representation learning approach with a dedicated speech branch and early vision-audio alignment, improving upon state-of-the-art audiovisual methods.

Findings

01

SAVE outperforms AVIGATE by +4.1% on MSRVTT-9k

02

SAVE improves retrieval metrics on five benchmarks

03

Effective speech embedding enhances video-text retrieval performance

Abstract

For video-text retrieval, the use of CLIP has been a de facto choice. Since CLIP provides only image and text encoders, this consensus has led to a biased paradigm that entirely ignores the sound track of videos. While several attempts have been made to reintroduce audio -- typically by incorporating an audio encoder and fusing its output with visual features -- these methods face two challenges: ineffective representation of speech content and suboptimal vision-audio fusion. To address these issues jointly, we propose SAVE, a Speech Aware Video rEpresentation learning method. SAVE improves upon AVIGATE, a SOTA audiovisual method, with a dedicated speech branch for more effective speech embedding. Furthermore, we introduce soft-ALBEF for early vision-audio alignment that facilitates fusion. Extensive experiments on five benchmarks show that SAVE compares favorably against the SOTA,…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsSpeech and Audio Processing · Multimodal Machine Learning Applications · Music and Audio Processing