Leveraging Pretrained Image-text Models for Improving Audio-Visual   Learning

Saurabhchand Bhati; Jes\'us Villalba; Laureano Moro-Velazquez; Thomas; Thebaud; Najim Dehak

arXiv:2309.04628·eess.AS·September 12, 2023

Leveraging Pretrained Image-text Models for Improving Audio-Visual Learning

Saurabhchand Bhati, Jes\'us Villalba, Laureano Moro-Velazquez, Thomas, Thebaud, Najim Dehak

PDF

Open Access

TL;DR

This paper introduces Segmental SpeechCLIP, a hierarchical speech encoder that improves audio-visual learning by leveraging pretrained image and text models like CLIP, DINO, and RoBERTa, achieving significant performance gains.

Contribution

It proposes a novel hierarchical speech encoder that directly learns word-like units and effectively utilizes pretrained multimodal and unimodal models for improved audio-visual learning.

Findings

01

Significant improvements over cascaded SpeechCLIP.

02

Audio-only systems perform close to audio-visual systems.

03

Mapping audio to CLIP vocabulary embeddings enhances semantic understanding.

Abstract

Visually grounded speech systems learn from paired images and their spoken captions. Recently, there have been attempts to utilize the visually grounded models trained from images and their corresponding text captions, such as CLIP, to improve speech-based visually grounded models' performance. However, the majority of these models only utilize the pretrained image encoder. Cascaded SpeechCLIP attempted to generate localized word-level information and utilize both the pretrained image and text encoders. Despite using both, they noticed a substantial drop in retrieval performance. We proposed Segmental SpeechCLIP which used a hierarchical segmental speech encoder to generate sequences of word-like units. We used the pretrained CLIP text encoder on top of these word-like unit representations and showed significant improvements over the cascaded variant of SpeechCLIP. Segmental SpeechCLIP…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsMultimodal Machine Learning Applications · Advanced Image and Video Retrieval Techniques · Music and Audio Processing

MethodsMulti-Head Attention · Attention Is All You Need · Attention Dropout · Adam · Weight Decay · Refunds@Expedia|||How do I get a full refund from Expedia? · Linear Warmup With Linear Decay · WordPiece · Dropout · BERT