Scaling up masked audio encoder learning for general audio   classification

Heinrich Dinkel; Zhiyong Yan; Yongqing Wang; Junbo Zhang; Yujun Wang,; Bin Wang

arXiv:2406.06992·cs.SD·June 14, 2024

Scaling up masked audio encoder learning for general audio classification

Heinrich Dinkel, Zhiyong Yan, Yongqing Wang, Junbo Zhang, Yujun Wang,, Bin Wang

PDF

Open Access 2 Repos 1 Models

TL;DR

This paper introduces Dasheng, a large-scale self-supervised audio encoder trained on diverse data, significantly improving general audio classification across speech, music, and environmental sounds.

Contribution

The paper presents Dasheng, a scalable SSL audio encoder trained on 1.2 billion parameters and 272,356 hours of data, achieving state-of-the-art results on multiple audio benchmarks.

Findings

01

Dasheng outperforms previous models on HEAR benchmark tasks.

02

It demonstrates strong generalization across speech, music, and environmental sounds.

03

Nearest-neighbor analysis shows rich, multi-domain audio representations.

Abstract

Despite progress in audio classification, a generalization gap remains between speech and other sound domains, such as environmental sounds and music. Models trained for speech tasks often fail to perform well on environmental or musical audio tasks, and vice versa. While self-supervised (SSL) audio representations offer an alternative, there has been limited exploration of scaling both model and dataset sizes for SSL-based general audio classification. We introduce Dasheng, a simple SSL audio encoder, based on the efficient masked autoencoder framework. Trained with 1.2 billion parameters on 272,356 hours of diverse audio, Dasheng obtains significant performance gains on the HEAR benchmark. It outperforms previous works on CREMA-D, LibriCount, Speech Commands, VoxLingua, and competes well in music and environment classification. Dasheng features inherently contain rich speech, music,…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

Models

🤗
richermans/README.md
model

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsMusic and Audio Processing · Speech and Audio Processing