Connecting the Dots between Audio and Text without Parallel Data through Visual Knowledge Transfer
Yanpeng Zhao, Jack Hessel, Youngjae Yu, Ximing Lu, Rowan Zellers,, Yejin Choi

TL;DR
This paper introduces VIP-ANT, a novel method that aligns audio and text representations without relying on parallel data by leveraging visual information as a bridge, achieving state-of-the-art zero-shot audio classification and caption retrieval.
Contribution
VIP-ANT is the first approach to connect audio and text without parallel data by using image modality as a pivot in a tri-modal embedding space.
Findings
Achieves state-of-the-art zero-shot performance on ESC50 and US8K datasets.
Surpasses supervised methods in Clotho caption retrieval with audio queries.
Minimal supervised data significantly boosts zero-shot accuracy.
Abstract
Machines that can represent and describe environmental soundscapes have practical potential, e.g., for audio tagging and captioning systems. Prevailing learning paradigms have been relying on parallel audio-text data, which is, however, scarcely available on the web. We propose VIP-ANT that induces \textbf{A}udio-\textbf{T}ext alignment without using any parallel audio-text data. Our key idea is to share the image modality between bi-modal image-text representations and bi-modal image-audio representations; the image modality functions as a pivot and connects audio and text in a tri-modal embedding space implicitly. In a difficult zero-shot setting with no paired audio-text data, our model demonstrates state-of-the-art zero-shot performance on the ESC50 and US8K audio classification tasks, and even surpasses the supervised state of the art for Clotho caption retrieval (with audio…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsMusic and Audio Processing · Diverse Musicological Studies · Speech and Audio Processing
