BEATs: Audio Pre-Training with Acoustic Tokenizers

Sanyuan Chen; Yu Wu; Chengyi Wang; Shujie Liu; Daniel Tompkins; Zhuo; Chen; Furu Wei

arXiv:2212.09058·eess.AS·December 20, 2022·40 cites

BEATs: Audio Pre-Training with Acoustic Tokenizers

Sanyuan Chen, Yu Wu, Chengyi Wang, Shujie Liu, Daniel Tompkins, Zhuo, Chen, Furu Wei

PDF

Open Access 4 Repos 4 Models 2 Datasets 1 Video

TL;DR

BEATs introduces an iterative framework for audio pre-training that leverages acoustic tokenizers and SSL models to improve high-level audio understanding, achieving state-of-the-art results without external data.

Contribution

The paper proposes a novel iterative training method combining acoustic tokenizers with SSL models, enhancing semantic audio representation and performance.

Findings

01

Achieved state-of-the-art mAP 50.6% on AudioSet-2M

02

Reached 98.1% accuracy on ESC-50

03

Generated rich semantic discrete labels for audio

Abstract

The massive growth of self-supervised learning (SSL) has been witnessed in language, vision, speech, and audio domains over the past few years. While discrete label prediction is widely adopted for other modalities, the state-of-the-art audio SSL models still employ reconstruction loss for pre-training. Compared with reconstruction loss, semantic-rich discrete label prediction encourages the SSL model to abstract the high-level audio semantics and discard the redundant details as in human perception. However, a semantic-rich acoustic tokenizer for general audio pre-training is usually not straightforward to obtain, due to the continuous property of audio and unavailable phoneme sequences like speech. To tackle this challenge, we propose BEATs, an iterative audio pre-training framework to learn Bidirectional Encoder representation from Audio Transformers, where an acoustic tokenizer and…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

Models

Datasets

Videos

BEATs: Audio Pre-Training with Acoustic Tokenizers· slideslive

Taxonomy

TopicsMusic and Audio Processing · Speech and Audio Processing · Speech Recognition and Synthesis

MethodsLinear Layer · Softmax · Multi-Head Attention · Dense Connections · Attention Is All You Need · Residual Connection · Layer Normalization · Vision Transformer