Unsupervised Audio-Caption Aligning Learns Correspondences between   Individual Sound Events and Textual Phrases

Huang Xie; Okko R\"as\"anen; Konstantinos Drossos; Tuomas Virtanen

arXiv:2110.02939·eess.AS·February 22, 2022·ICASSP

Unsupervised Audio-Caption Aligning Learns Correspondences between Individual Sound Events and Textual Phrases

Huang Xie, Okko R\"as\"anen, Konstantinos Drossos, Tuomas Virtanen

PDF

Open Access 1 Repo

TL;DR

This paper presents an unsupervised method for aligning audio clips with textual captions, enabling the learning of correspondences between individual sound events and phrases without annotations, demonstrated through retrieval and sound event detection tasks.

Contribution

The authors introduce a novel unsupervised approach that aligns audio and text by scoring similarities at the frame and word level, learning local and global correspondences without supervision.

Findings

01

Effective cross-modal retrieval performance

02

Successful local sound event-phrase correspondence learning

03

Competitive results in unsupervised sound event detection

Abstract

We investigate unsupervised learning of correspondences between sound events and textual phrases through aligning audio clips with textual captions describing the content of a whole audio clip. We align originally unaligned and unannotated audio clips and their captions by scoring the similarities between audio frames and words, as encoded by modality-specific encoders and using a ranking-loss criterion to optimize the model. After training, we obtain clip-caption similarity by averaging frame-word similarities and estimate event-phrase correspondences by calculating frame-phrase similarities. We evaluate the method with two cross-modal tasks: audio-caption retrieval, and phrase-based sound event detection (SED). Experimental results show that the proposed method can globally associate audio clips with captions as well as locally learn correspondences between individual sound events and…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

xieh97/dcase2022-audio-retrieval
pytorch

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsMusic and Audio Processing · Speech and Audio Processing · Speech Recognition and Synthesis