AudioFormer: Audio Transformer learns audio feature representations from discrete acoustic codes
Zhaohui Li, Haitao Wang, Xinghua Jiang

TL;DR
AudioFormer introduces a novel approach to audio classification by learning audio feature representations from discrete acoustic codes using a masked language model and contrastive learning, outperforming existing models on multiple datasets.
Contribution
The paper presents a new method that leverages discrete acoustic codes and contrastive learning to improve audio feature representation and classification performance.
Findings
Achieves state-of-the-art results on AudioSet and FSD50K datasets.
Outperforms existing monomodal and multimodal audio classification models.
Introduces a novel integration of discrete acoustic codes with MLM and MPC learning.
Abstract
We propose a method named AudioFormer,which learns audio feature representations through the acquisition of discrete acoustic codes and subsequently fine-tunes them for audio classification tasks. Initially,we introduce a novel perspective by considering the audio classification task as a form of natural language understanding (NLU). Leveraging an existing neural audio codec model,we generate discrete acoustic codes and utilize them to train a masked language model (MLM),thereby obtaining audio feature representations. Furthermore,we pioneer the integration of a Multi-Positive sample Contrastive (MPC) learning approach. This method enables the learning of joint representations among multiple discrete acoustic codes within the same audio input. In our experiments,we treat discrete acoustic codes as textual data and train a masked language model using a cloze-like methodology,ultimately…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsMusic and Audio Processing · Speech and Audio Processing · Speech Recognition and Synthesis
