TL;DR
This paper introduces a novel method combining autoencoders, word embeddings, and multi-head self-attention to learn contextual tag embeddings for aligning audio and tags, improving downstream audio classification tasks.
Contribution
It proposes a new approach that uses multi-head self-attention on tags to enhance audio representation learning and generalize to unseen tags.
Findings
Multi-head self-attention improves audio representations.
The method outperforms baseline models in classification tasks.
Joint optimization of AAE and MHA enhances embedding quality.
Abstract
Self-supervised audio representation learning offers an attractive alternative for obtaining generic audio embeddings, capable to be employed into various downstream tasks. Published approaches that consider both audio and words/tags associated with audio do not employ text processing models that are capable to generalize to tags unknown during training. In this work we propose a method for learning audio representations using an audio autoencoder (AAE), a general word embeddings model (WEM), and a multi-head self-attention (MHA) mechanism. MHA attends on the output of the WEM, providing a contextualized representation of the tags associated with the audio, and we align the output of MHA with the output of the encoder of AAE using a contrastive loss. We jointly optimize AAE and MHA and we evaluate the audio representations (i.e. the output of the encoder of AAE) by utilizing them in…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
MethodsSolana Customer Service Number +1-833-534-1729
