EAT: Self-Supervised Pre-Training with Efficient Audio Transformer
Wenxi Chen, Yuzhe Liang, Ziyang Ma, Zhisheng Zheng, Xie Chen

TL;DR
This paper introduces EAT, an efficient self-supervised audio transformer that improves representation learning in audio tasks while significantly reducing pre-training time, inspired by recent advances in image and audio SSL methods.
Contribution
EAT employs a novel bootstrap self-supervised paradigm with a new Utterance-Frame Objective and optimized masking strategies to enhance audio SSL effectiveness and efficiency.
Findings
Achieves state-of-the-art results on multiple audio benchmarks.
Reduces pre-training time by up to 15 times.
Demonstrates superior representation quality with large inverse block masks.
Abstract
Audio self-supervised learning (SSL) pre-training, which aims to learn good representations from unlabeled audio, has made remarkable progress. However, the extensive computational demands during pre-training pose a significant barrier to the potential application and optimization of audio SSL models. In this paper, inspired by the success of data2vec 2.0 in image modality and Audio-MAE in audio modality, we introduce Efficient Audio Transformer (EAT) to further improve the effectiveness and efficiency in audio SSL. The proposed EAT adopts the bootstrap self-supervised training paradigm to the audio domain. A novel Utterance-Frame Objective (UFO) is designed to enhance the modeling capability of acoustic events. Furthermore, we reveal that the masking strategy is critical in audio SSL pre-training, and superior audio representations can be obtained with large inverse block masks.…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
- 🤗saurabhati/DASS_small_AudioSet_48.6model· 10 dl10 dl
- 🤗saurabhati/DASS_medium_AudioSet_48.9model
- 🤗saurabhati/DASS_small_AudioSet_50.1model· 45 dl45 dl
- 🤗saurabhati/DASS_medium_AudioSet_50.2model· 53 dl· ♡ 253 dl♡ 2
- 🤗worstchan/EAT-base_epoch30_finetune_AS2Mmodel· 37k dl· ♡ 237k dl♡ 2
- 🤗worstchan/EAT-base_epoch30_pretrainmodel· 1.1k dl· ♡ 51.1k dl♡ 5
- 🤗worstchan/EAT-large_epoch20_finetune_AS2Mmodel· 515 dl· ♡ 3515 dl♡ 3
- 🤗worstchan/EAT-large_epoch20_pretrainmodel· 230 dl230 dl
- 🤗ta012/SSLAM_pretrainmodel· 758 dl758 dl
- 🤗ta012/SSLAM_AS2M_Finetunedmodel· 328 dl328 dl
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsMusic and Audio Processing · Speech Recognition and Synthesis · Speech and Audio Processing
MethodsLinear Layer · Dropout · Adam · Layer Normalization · Residual Connection · Absolute Position Encodings · Dense Connections · Position-Wise Feed-Forward Layer · Byte Pair Encoding · Softmax
