BrewCLIP: A Bifurcated Representation Learning Framework for Audio-Visual Retrieval
Zhenyu Lu, Lakshay Sethi

TL;DR
BrewCLIP introduces a bifurcated learning framework that captures both textual and non-textual speech information to enhance audio-visual retrieval, outperforming existing models.
Contribution
The paper proposes a novel dual-channel model that leverages both textual and non-textual speech features for improved audio-image matching.
Findings
Significant performance improvement over state-of-the-art methods
Effective utilization of non-textual speech information such as accent and mood
Robust retrieval results across multiple datasets
Abstract
Previous methods for audio-image matching generally fall into one of two categories: pipeline models or End-to-End models. Pipeline models first transcribe speech and then encode the resulting text; End-to-End models encode speech directly. Generally, pipeline models outperform end-to-end models, but the intermediate transcription necessarily discards some potentially useful non-textual information. In addition to textual information, speech can convey details such as accent, mood, and and emphasis, which should be effectively captured in the encoded representation. In this paper, we investigate whether non-textual information, which is overlooked by pipeline-based models, can be leveraged to improve speech-image matching performance. We thoroughly analyze and compare End-to-End models, pipeline models, and our proposed dual-channel model for robust audio-image retrieval on a variety of…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsMusic and Audio Processing · Speech Recognition and Synthesis · Speech and Audio Processing
