SpeechCLIP: Integrating Speech with Pre-Trained Vision and Language   Model

Yi-Jen Shih; Hsuan-Fu Wang; Heng-Jui Chang; Layne Berry; Hung-yi Lee,; David Harwath

arXiv:2210.00705·cs.CL·October 26, 2022·1 cites

SpeechCLIP: Integrating Speech with Pre-Trained Vision and Language Model

Yi-Jen Shih, Hsuan-Fu Wang, Heng-Jui Chang, Layne Berry, Hung-yi Lee,, David Harwath

PDF

Open Access 1 Repo

TL;DR

SpeechCLIP introduces a framework that connects speech, images, and text using pre-trained models, enabling improved speech understanding and retrieval without relying on transcribed speech data.

Contribution

It is the first to integrate speech with vision and language models via images, reducing the need for costly transcribed speech data.

Findings

01

Outperforms previous methods on image-speech retrieval tasks

02

Achieves zero-shot speech-text retrieval without transcriptions

03

Can directly retrieve semantically related keywords from speech

Abstract

Data-driven speech processing models usually perform well with a large amount of text supervision, but collecting transcribed speech data is costly. Therefore, we propose SpeechCLIP, a novel framework bridging speech and text through images to enhance speech models without transcriptions. We leverage state-of-the-art pre-trained HuBERT and CLIP, aligning them via paired images and spoken captions with minimal fine-tuning. SpeechCLIP outperforms prior state-of-the-art on image-speech retrieval and performs zero-shot speech-text retrieval without direct supervision from transcriptions. Moreover, SpeechCLIP can directly retrieve semantically related keywords from speech.

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

atosystem/speechclip
pytorchOfficial

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsSpeech Recognition and Synthesis · Natural Language Processing Techniques · Topic Modeling

MethodsContrastive Language-Image Pre-training