YourMT3+: Multi-instrument Music Transcription with Enhanced Transformer   Architectures and Cross-dataset Stem Augmentation

Sungkyun Chang; Emmanouil Benetos; Holger Kirchhoff; Simon Dixon

arXiv:2407.04822·eess.AS·August 2, 2024

YourMT3+: Multi-instrument Music Transcription with Enhanced Transformer Architectures and Cross-dataset Stem Augmentation

Sungkyun Chang, Emmanouil Benetos, Holger Kirchhoff, Simon Dixon

PDF

Open Access 1 Repo 1 Models 1 Datasets

TL;DR

YourMT3+ introduces an advanced multi-instrument music transcription model utilizing enhanced transformer architectures, mixture of experts, and innovative data augmentation techniques, achieving state-of-the-art results across multiple datasets without requiring voice separation.

Contribution

The paper presents YourMT3+, a novel multi-instrument transcription model with hierarchical attention, mixture of experts, and new data augmentation methods, improving performance and reducing data annotation challenges.

Findings

01

Achieves competitive or superior results on ten public datasets.

02

Enables direct vocal transcription without voice separation.

03

Demonstrates limitations on pop music recordings.

Abstract

Multi-instrument music transcription aims to convert polyphonic music recordings into musical scores assigned to each instrument. This task is challenging for modeling as it requires simultaneously identifying multiple instruments and transcribing their pitch and precise timing, and the lack of fully annotated data adds to the training difficulties. This paper introduces YourMT3+, a suite of models for enhanced multi-instrument music transcription based on the recent language token decoding approach of MT3. We enhance its encoder by adopting a hierarchical attention transformer in the time-frequency domain and integrating a mixture of experts. To address data limitations, we introduce a new multi-channel decoding method for training with incomplete annotations and propose intra- and cross-stem augmentation for dataset mixing. Our experiments demonstrate direct vocal transcription…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

mimbres/yourmt3
noneOfficial

Models

🤗
mimbres/YourMT3
model

Datasets

Richhiey/YourMT3
dataset· 151 dl
151 dl

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsMusic and Audio Processing · Diverse Musicological Studies · Music Technology and Sound Studies

MethodsLinear Layer · Root Mean Square Layer Normalization · Rotary Position Embedding · Dropout · Attention Is All You Need · Multi-Head Attention · T5 · Perceiver IO · Mixture of Experts