EnCodecMAE: Leveraging neural codecs for universal audio representation   learning

Leonardo Pepino; Pablo Riera; Luciana Ferrer

arXiv:2309.07391·cs.SD·May 22, 2024·2 cites

EnCodecMAE: Leveraging neural codecs for universal audio representation learning

Leonardo Pepino, Pablo Riera, Luciana Ferrer

PDF

Open Access 2 Repos 5 Models

TL;DR

EnCodecMAE introduces a masked autoencoder approach that predicts neural codec units from unmasked audio segments, creating a versatile universal audio representation applicable across speech, music, and environmental sounds.

Contribution

This work pioneers the use of masked autoencoding with neural codec units for universal audio representation learning, outperforming existing models across multiple tasks.

Findings

01

Outperforms state-of-the-art audio models on various tasks

02

Achieves competitive results in automatic speech recognition

03

Demonstrates versatility across speech, music, and environmental sounds

Abstract

The goal of universal audio representation learning is to obtain foundational models that can be used for a variety of downstream tasks involving speech, music and environmental sounds. To approach this problem, methods inspired by works on self-supervised learning for NLP, like BERT, or computer vision, like masked autoencoders (MAE), are often adapted to the audio domain. In this work, we propose masking representations of the audio signal, and training a MAE to reconstruct the masked segments. The reconstruction is done by predicting the discrete units generated by EnCodec, a neural audio codec, from the unmasked inputs. We evaluate this approach, which we call EnCodecMAE, on a wide range of tasks involving speech, music and environmental sounds. Our best model outperforms various state-of-the-art audio representation models in terms of global performance. Additionally, we evaluate…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

Models

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsMusic and Audio Processing · Speech Recognition and Synthesis · Speech and Audio Processing

MethodsMasked autoencoder · Multi-Head Attention · Attention Is All You Need · Linear Layer · Attention Dropout · Residual Connection · Adam · Weight Decay · Softmax · Refunds@Expedia|||How do I get a full refund from Expedia?