OpenBEATs: A Fully Open-Source General-Purpose Audio Encoder

Shikhar Bharadwaj; Samuele Cornell; Kwanghee Choi; Satoru Fukayama; Hye-jin Shim; Soham Deshmukh; Shinji Watanabe

arXiv:2507.14129·cs.SD·July 21, 2025

OpenBEATs: A Fully Open-Source General-Purpose Audio Encoder

Shikhar Bharadwaj, Samuele Cornell, Kwanghee Choi, Satoru Fukayama, Hye-jin Shim, Soham Deshmukh, Shinji Watanabe

PDF

Open Access 10 Models 1 Datasets

TL;DR

OpenBEATs introduces an open-source, multi-domain audio pre-training framework using masked token prediction, achieving state-of-the-art results across diverse audio understanding tasks and datasets, thereby advancing general-purpose audio representation learning.

Contribution

It extends BEATs with multi-domain pre-training and open-source code, enabling broader application and reproducibility in general audio understanding.

Findings

01

State-of-the-art performance on bioacoustics and environmental sound datasets.

02

Effective multi-domain pre-training improves general audio representations.

03

Models outperform larger models at a fraction of the parameters.

Abstract

Masked token prediction has emerged as a powerful pre-training objective across language, vision, and speech, offering the potential to unify these diverse modalities through a single pre-training task. However, its application for general audio understanding remains underexplored, with BEATs being the only notable example. BEATs has seen limited modifications due to the absence of open-source pre-training code. Furthermore, BEATs was trained only on AudioSet, restricting its broader downstream applicability. To address these gaps, we present OpenBEATs, an open-source framework that extends BEATs via multi-domain audio pre-training. We conduct comprehensive evaluations across six types of tasks, twenty five datasets, and three audio domains, including audio reasoning tasks such as audio question answering, entailment, and captioning. OpenBEATs achieves state-of-the-art performance on…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Models

Datasets

Bencr/beats-checkpoints
dataset· 252 dl
252 dl

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsSpeech and Audio Processing · Music and Audio Processing · Speech Recognition and Synthesis