Sequence-to-Sequence Piano Transcription with Transformers
Curtis Hawthorne, Ian Simon, Rigel Swavely, Ethan Manilow, Jesse Engel

TL;DR
This paper demonstrates that a generic encoder-decoder Transformer can effectively perform music transcription from spectrograms to MIDI-like outputs, simplifying the process and reducing the need for task-specific models.
Contribution
It introduces a sequence-to-sequence Transformer model for music transcription that achieves comparable performance to specialized models using standard decoding methods.
Findings
Transformer-based model achieves state-of-the-art transcription accuracy
Simplifies architecture by removing task-specific design
Enables direct translation from spectrograms to MIDI-like events
Abstract
Automatic Music Transcription has seen significant progress in recent years by training custom deep neural networks on large datasets. However, these models have required extensive domain-specific design of network architectures, input/output representations, and complex decoding schemes. In this work, we show that equivalent performance can be achieved using a generic encoder-decoder Transformer with standard decoding methods. We demonstrate that the model can learn to translate spectrogram inputs directly to MIDI-like output events for several transcription tasks. This sequence-to-sequence approach simplifies transcription by jointly modeling audio features and language-like output dependencies, thus removing the need for task-specific architectures. These results point toward possibilities for creating new Music Information Retrieval models by focusing on dataset creation and…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsMusic and Audio Processing · Diverse Musicological Studies · Music Technology and Sound Studies
MethodsMulti-Head Attention · Attention Is All You Need · Linear Layer · Absolute Position Encodings · Position-Wise Feed-Forward Layer · Label Smoothing · Residual Connection · Layer Normalization · Softmax · Dense Connections
