MusiCoder: A Universal Music-Acoustic Encoder Based on Transformers
Yilun Zhao, Jia Guo

TL;DR
MusiCoder introduces a self-supervised transformer-based music acoustic encoder that improves music annotation tasks by leveraging masked reconstruction pre-training on unlabeled data.
Contribution
It proposes a novel self-supervised learning method for music acoustic representation using transformers with new masking objectives, outperforming existing models.
Findings
Outperforms state-of-the-art in genre classification
Outperforms in auto-tagging tasks
Demonstrates effectiveness of self-supervised pre-training for music understanding
Abstract
Music annotation has always been one of the critical topics in the field of Music Information Retrieval (MIR). Traditional models use supervised learning for music annotation tasks. However, as supervised machine learning approaches increase in complexity, the increasing need for more annotated training data can often not be matched with available data. In this paper, a new self-supervised music acoustic representation learning approach named MusiCoder is proposed. Inspired by the success of BERT, MusiCoder builds upon the architecture of self-attention bidirectional transformers. Two pre-training objectives, including Contiguous Frames Masking (CFM) and Contiguous Channels Masking (CCM), are designed to adapt BERT-like masked reconstruction pre-training to continuous acoustic frame domain. The performance of MusiCoder is evaluated in two downstream music annotation tasks. The results…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
MethodsLinear Layer · Weight Decay · WordPiece · Softmax · Dense Connections · Attention Is All You Need · Linear Warmup With Linear Decay · Dropout · Layer Normalization · Residual Connection
