Learning Contextual Tag Embeddings for Cross-Modal Alignment of Audio   and Tags

Xavier Favory; Konstantinos Drossos; Tuomas Virtanen; Xavier Serra

arXiv:2010.14171·cs.SD·October 28, 2020

Learning Contextual Tag Embeddings for Cross-Modal Alignment of Audio and Tags

Xavier Favory, Konstantinos Drossos, Tuomas Virtanen, Xavier Serra

PDF

1 Repo

TL;DR

This paper introduces a novel method combining autoencoders, word embeddings, and multi-head self-attention to learn contextual tag embeddings for aligning audio and tags, improving downstream audio classification tasks.

Contribution

It proposes a new approach that uses multi-head self-attention on tags to enhance audio representation learning and generalize to unseen tags.

Findings

01

Multi-head self-attention improves audio representations.

02

The method outperforms baseline models in classification tasks.

03

Joint optimization of AAE and MHA enhances embedding quality.

Abstract

Self-supervised audio representation learning offers an attractive alternative for obtaining generic audio embeddings, capable to be employed into various downstream tasks. Published approaches that consider both audio and words/tags associated with audio do not employ text processing models that are capable to generalize to tags unknown during training. In this work we propose a method for learning audio representations using an audio autoencoder (AAE), a general word embeddings model (WEM), and a multi-head self-attention (MHA) mechanism. MHA attends on the output of the WEM, providing a contextualized representation of the tags associated with the audio, and we align the output of MHA with the output of the encoder of AAE using a contrastive loss. We jointly optimize AAE and MHA and we evaluate the audio representations (i.e. the output of the encoder of AAE) by utilizing them in…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

xavierfav/ae-w2v-attention
pytorchOfficial

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

MethodsSolana Customer Service Number +1-833-534-1729