Pay Attention to the Keys: Visual Piano Transcription Using Transformers
Uros Zivanovic, Ivan Pilkov, Carlos Eduardo Cancino-Chac\'on

TL;DR
This paper introduces a vision transformer-based system for visual piano transcription that outperforms previous CNN-based methods, utilizing a new dataset and predicting note offsets for improved accuracy.
Contribution
The paper presents a novel VPT system based on ViT, introduces the R3 dataset, and proposes a method to predict note offsets, advancing the state-of-the-art in visual piano transcription.
Findings
Outperforms CNN-based methods on PianoYT and R3 datasets
Achieves higher accuracy in onset and offset prediction
Introduces a new dataset and offset prediction approach
Abstract
Visual piano transcription (VPT) is the task of obtaining a symbolic representation of a piano performance from visual information alone (e.g., from a top-down video of the piano keyboard). In this work we propose a VPT system based on the vision transformer (ViT), which surpasses previous methods based on convolutional neural networks (CNNs). Our system is trained on the newly introduced R3 dataset, consisting of ca.~31 hours of synchronized video and MIDI recordings of piano performances. We additionally introduce an approach to predict note offsets, which has not been previously explored in this context. We show that our system outperforms the state-of-the-art on the PianoYT dataset for onset prediction and on the R3 dataset for both onsets and offsets.
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsMusic and Audio Processing · Music Technology and Sound Studies · Speech and Audio Processing
MethodsAttention Is All You Need · Absolute Position Encodings · Label Smoothing · Adam · Residual Connection · Softmax · Linear Layer · Dropout · Layer Normalization · Multi-Head Attention
