YourMT3+: Multi-instrument Music Transcription with Enhanced Transformer Architectures and Cross-dataset Stem Augmentation
Sungkyun Chang, Emmanouil Benetos, Holger Kirchhoff, Simon Dixon

TL;DR
YourMT3+ introduces an advanced multi-instrument music transcription model utilizing enhanced transformer architectures, mixture of experts, and innovative data augmentation techniques, achieving state-of-the-art results across multiple datasets without requiring voice separation.
Contribution
The paper presents YourMT3+, a novel multi-instrument transcription model with hierarchical attention, mixture of experts, and new data augmentation methods, improving performance and reducing data annotation challenges.
Findings
Achieves competitive or superior results on ten public datasets.
Enables direct vocal transcription without voice separation.
Demonstrates limitations on pop music recordings.
Abstract
Multi-instrument music transcription aims to convert polyphonic music recordings into musical scores assigned to each instrument. This task is challenging for modeling as it requires simultaneously identifying multiple instruments and transcribing their pitch and precise timing, and the lack of fully annotated data adds to the training difficulties. This paper introduces YourMT3+, a suite of models for enhanced multi-instrument music transcription based on the recent language token decoding approach of MT3. We enhance its encoder by adopting a hierarchical attention transformer in the time-frequency domain and integrating a mixture of experts. To address data limitations, we introduce a new multi-channel decoding method for training with incomplete annotations and propose intra- and cross-stem augmentation for dataset mixing. Our experiments demonstrate direct vocal transcription…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsMusic and Audio Processing · Diverse Musicological Studies · Music Technology and Sound Studies
MethodsLinear Layer · Root Mean Square Layer Normalization · Rotary Position Embedding · Dropout · Attention Is All You Need · Multi-Head Attention · T5 · Perceiver IO · Mixture of Experts
