Exploring Transformer's potential on automatic piano transcription

Longshen Ou; Ziyi Guo; Emmanouil Benetos; Jiqing Han; Ye Wang

arXiv:2204.03898·eess.AS·April 11, 2022

Exploring Transformer's potential on automatic piano transcription

Longshen Ou, Ziyi Guo, Emmanouil Benetos, Jiqing Han, Ye Wang

PDF

Open Access

TL;DR

This paper investigates the use of Transformer models in automatic piano transcription, demonstrating improved accuracy over traditional neural network approaches on multiple datasets.

Contribution

It introduces Transformer architecture into AMT, showing its advantages for velocity detection and overall transcription accuracy compared to existing methods.

Findings

01

Transformer outperforms CNN-RNN models on velocity detection

02

Performance improves on frame-level metrics

03

Performance improves on note-level metrics

Abstract

Most recent research about automatic music transcription (AMT) uses convolutional neural networks and recurrent neural networks to model the mapping from music signals to symbolic notation. Based on a high-resolution piano transcription system, we explore the possibility of incorporating another powerful sequence transformation tool -- the Transformer -- to deal with the AMT problem. We argue that the properties of the Transformer make it more suitable for certain AMT subtasks. We confirm the Transformer's superiority on the velocity detection task by experiments on the MAESTRO dataset and a cross-dataset evaluation on the MAPS dataset. We observe a performance improvement on both frame-level and note-level metrics after introducing the Transformer network.

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsMusic and Audio Processing · Music Technology and Sound Studies · Speech and Audio Processing

MethodsMulti-Head Attention · Attention Is All You Need · Linear Layer · Byte Pair Encoding · Position-Wise Feed-Forward Layer · Dense Connections · Dropout · Layer Normalization · Absolute Position Encodings · Softmax