Leveraging Pretrained Image-text Models for Improving Audio-Visual Learning
Saurabhchand Bhati, Jes\'us Villalba, Laureano Moro-Velazquez, Thomas, Thebaud, Najim Dehak

TL;DR
This paper introduces Segmental SpeechCLIP, a hierarchical speech encoder that improves audio-visual learning by leveraging pretrained image and text models like CLIP, DINO, and RoBERTa, achieving significant performance gains.
Contribution
It proposes a novel hierarchical speech encoder that directly learns word-like units and effectively utilizes pretrained multimodal and unimodal models for improved audio-visual learning.
Findings
Significant improvements over cascaded SpeechCLIP.
Audio-only systems perform close to audio-visual systems.
Mapping audio to CLIP vocabulary embeddings enhances semantic understanding.
Abstract
Visually grounded speech systems learn from paired images and their spoken captions. Recently, there have been attempts to utilize the visually grounded models trained from images and their corresponding text captions, such as CLIP, to improve speech-based visually grounded models' performance. However, the majority of these models only utilize the pretrained image encoder. Cascaded SpeechCLIP attempted to generate localized word-level information and utilize both the pretrained image and text encoders. Despite using both, they noticed a substantial drop in retrieval performance. We proposed Segmental SpeechCLIP which used a hierarchical segmental speech encoder to generate sequences of word-like units. We used the pretrained CLIP text encoder on top of these word-like unit representations and showed significant improvements over the cascaded variant of SpeechCLIP. Segmental SpeechCLIP…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsMultimodal Machine Learning Applications · Advanced Image and Video Retrieval Techniques · Music and Audio Processing
MethodsMulti-Head Attention · Attention Is All You Need · Attention Dropout · Adam · Weight Decay · Refunds@Expedia|||How do I get a full refund from Expedia? · Linear Warmup With Linear Decay · WordPiece · Dropout · BERT
