BrewCLIP: A Bifurcated Representation Learning Framework for   Audio-Visual Retrieval

Zhenyu Lu; Lakshay Sethi

arXiv:2408.10383·cs.SD·August 21, 2024

BrewCLIP: A Bifurcated Representation Learning Framework for Audio-Visual Retrieval

Zhenyu Lu, Lakshay Sethi

PDF

Open Access

TL;DR

BrewCLIP introduces a bifurcated learning framework that captures both textual and non-textual speech information to enhance audio-visual retrieval, outperforming existing models.

Contribution

The paper proposes a novel dual-channel model that leverages both textual and non-textual speech features for improved audio-image matching.

Findings

01

Significant performance improvement over state-of-the-art methods

02

Effective utilization of non-textual speech information such as accent and mood

03

Robust retrieval results across multiple datasets

Abstract

Previous methods for audio-image matching generally fall into one of two categories: pipeline models or End-to-End models. Pipeline models first transcribe speech and then encode the resulting text; End-to-End models encode speech directly. Generally, pipeline models outperform end-to-end models, but the intermediate transcription necessarily discards some potentially useful non-textual information. In addition to textual information, speech can convey details such as accent, mood, and and emphasis, which should be effectively captured in the encoded representation. In this paper, we investigate whether non-textual information, which is overlooked by pipeline-based models, can be leveraged to improve speech-image matching performance. We thoroughly analyze and compare End-to-End models, pipeline models, and our proposed dual-channel model for robust audio-image retrieval on a variety of…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsMusic and Audio Processing · Speech Recognition and Synthesis · Speech and Audio Processing