Sequence-to-Sequence Piano Transcription with Transformers

Curtis Hawthorne; Ian Simon; Rigel Swavely; Ethan Manilow; Jesse Engel

arXiv:2107.09142·cs.SD·July 21, 2021·20 cites

Sequence-to-Sequence Piano Transcription with Transformers

Curtis Hawthorne, Ian Simon, Rigel Swavely, Ethan Manilow, Jesse Engel

PDF

Open Access 2 Repos

TL;DR

This paper demonstrates that a generic encoder-decoder Transformer can effectively perform music transcription from spectrograms to MIDI-like outputs, simplifying the process and reducing the need for task-specific models.

Contribution

It introduces a sequence-to-sequence Transformer model for music transcription that achieves comparable performance to specialized models using standard decoding methods.

Findings

01

Transformer-based model achieves state-of-the-art transcription accuracy

02

Simplifies architecture by removing task-specific design

03

Enables direct translation from spectrograms to MIDI-like events

Abstract

Automatic Music Transcription has seen significant progress in recent years by training custom deep neural networks on large datasets. However, these models have required extensive domain-specific design of network architectures, input/output representations, and complex decoding schemes. In this work, we show that equivalent performance can be achieved using a generic encoder-decoder Transformer with standard decoding methods. We demonstrate that the model can learn to translate spectrogram inputs directly to MIDI-like output events for several transcription tasks. This sequence-to-sequence approach simplifies transcription by jointly modeling audio features and language-like output dependencies, thus removing the need for task-specific architectures. These results point toward possibilities for creating new Music Information Retrieval models by focusing on dataset creation and…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsMusic and Audio Processing · Diverse Musicological Studies · Music Technology and Sound Studies

MethodsMulti-Head Attention · Attention Is All You Need · Linear Layer · Absolute Position Encodings · Position-Wise Feed-Forward Layer · Label Smoothing · Residual Connection · Layer Normalization · Softmax · Dense Connections