AudioFormer: Audio Transformer learns audio feature representations from   discrete acoustic codes

Zhaohui Li; Haitao Wang; Xinghua Jiang

arXiv:2308.07221·cs.SD·August 28, 2023

AudioFormer: Audio Transformer learns audio feature representations from discrete acoustic codes

Zhaohui Li, Haitao Wang, Xinghua Jiang

PDF

Open Access

TL;DR

AudioFormer introduces a novel approach to audio classification by learning audio feature representations from discrete acoustic codes using a masked language model and contrastive learning, outperforming existing models on multiple datasets.

Contribution

The paper presents a new method that leverages discrete acoustic codes and contrastive learning to improve audio feature representation and classification performance.

Findings

01

Achieves state-of-the-art results on AudioSet and FSD50K datasets.

02

Outperforms existing monomodal and multimodal audio classification models.

03

Introduces a novel integration of discrete acoustic codes with MLM and MPC learning.

Abstract

We propose a method named AudioFormer,which learns audio feature representations through the acquisition of discrete acoustic codes and subsequently fine-tunes them for audio classification tasks. Initially,we introduce a novel perspective by considering the audio classification task as a form of natural language understanding (NLU). Leveraging an existing neural audio codec model,we generate discrete acoustic codes and utilize them to train a masked language model (MLM),thereby obtaining audio feature representations. Furthermore,we pioneer the integration of a Multi-Positive sample Contrastive (MPC) learning approach. This method enables the learning of joint representations among multiple discrete acoustic codes within the same audio input. In our experiments,we treat discrete acoustic codes as textual data and train a masked language model using a cloze-like methodology,ultimately…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsMusic and Audio Processing · Speech and Audio Processing · Speech Recognition and Synthesis