Exploring Transformer's potential on automatic piano transcription
Longshen Ou, Ziyi Guo, Emmanouil Benetos, Jiqing Han, Ye Wang

TL;DR
This paper investigates the use of Transformer models in automatic piano transcription, demonstrating improved accuracy over traditional neural network approaches on multiple datasets.
Contribution
It introduces Transformer architecture into AMT, showing its advantages for velocity detection and overall transcription accuracy compared to existing methods.
Findings
Transformer outperforms CNN-RNN models on velocity detection
Performance improves on frame-level metrics
Performance improves on note-level metrics
Abstract
Most recent research about automatic music transcription (AMT) uses convolutional neural networks and recurrent neural networks to model the mapping from music signals to symbolic notation. Based on a high-resolution piano transcription system, we explore the possibility of incorporating another powerful sequence transformation tool -- the Transformer -- to deal with the AMT problem. We argue that the properties of the Transformer make it more suitable for certain AMT subtasks. We confirm the Transformer's superiority on the velocity detection task by experiments on the MAESTRO dataset and a cross-dataset evaluation on the MAPS dataset. We observe a performance improvement on both frame-level and note-level metrics after introducing the Transformer network.
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsMusic and Audio Processing · Music Technology and Sound Studies · Speech and Audio Processing
MethodsMulti-Head Attention · Attention Is All You Need · Linear Layer · Byte Pair Encoding · Position-Wise Feed-Forward Layer · Dense Connections · Dropout · Layer Normalization · Absolute Position Encodings · Softmax
