EAT: Self-Supervised Pre-Training with Efficient Audio Transformer

Wenxi Chen; Yuzhe Liang; Ziyang Ma; Zhisheng Zheng; Xie Chen

arXiv:2401.03497·eess.AS·January 9, 2024·1 cites

EAT: Self-Supervised Pre-Training with Efficient Audio Transformer

Wenxi Chen, Yuzhe Liang, Ziyang Ma, Zhisheng Zheng, Xie Chen

PDF

Open Access 1 Repo 10 Models

TL;DR

This paper introduces EAT, an efficient self-supervised audio transformer that improves representation learning in audio tasks while significantly reducing pre-training time, inspired by recent advances in image and audio SSL methods.

Contribution

EAT employs a novel bootstrap self-supervised paradigm with a new Utterance-Frame Objective and optimized masking strategies to enhance audio SSL effectiveness and efficiency.

Findings

01

Achieves state-of-the-art results on multiple audio benchmarks.

02

Reduces pre-training time by up to 15 times.

03

Demonstrates superior representation quality with large inverse block masks.

Abstract

Audio self-supervised learning (SSL) pre-training, which aims to learn good representations from unlabeled audio, has made remarkable progress. However, the extensive computational demands during pre-training pose a significant barrier to the potential application and optimization of audio SSL models. In this paper, inspired by the success of data2vec 2.0 in image modality and Audio-MAE in audio modality, we introduce Efficient Audio Transformer (EAT) to further improve the effectiveness and efficiency in audio SSL. The proposed EAT adopts the bootstrap self-supervised training paradigm to the audio domain. A novel Utterance-Frame Objective (UFO) is designed to enhance the modeling capability of acoustic events. Furthermore, we reveal that the masking strategy is critical in audio SSL pre-training, and superior audio representations can be obtained with large inverse block masks.…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

cwx-worst-one/eat
pytorchOfficial

Models

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsMusic and Audio Processing · Speech Recognition and Synthesis · Speech and Audio Processing

MethodsLinear Layer · Dropout · Adam · Layer Normalization · Residual Connection · Absolute Position Encodings · Dense Connections · Position-Wise Feed-Forward Layer · Byte Pair Encoding · Softmax