Pay Attention to the Keys: Visual Piano Transcription Using Transformers

Uros Zivanovic; Ivan Pilkov; Carlos Eduardo Cancino-Chac\'on

arXiv:2411.09037·cs.CV·September 30, 2025

Pay Attention to the Keys: Visual Piano Transcription Using Transformers

Uros Zivanovic, Ivan Pilkov, Carlos Eduardo Cancino-Chac\'on

PDF

Open Access

TL;DR

This paper introduces a vision transformer-based system for visual piano transcription that outperforms previous CNN-based methods, utilizing a new dataset and predicting note offsets for improved accuracy.

Contribution

The paper presents a novel VPT system based on ViT, introduces the R3 dataset, and proposes a method to predict note offsets, advancing the state-of-the-art in visual piano transcription.

Findings

01

Outperforms CNN-based methods on PianoYT and R3 datasets

02

Achieves higher accuracy in onset and offset prediction

03

Introduces a new dataset and offset prediction approach

Abstract

Visual piano transcription (VPT) is the task of obtaining a symbolic representation of a piano performance from visual information alone (e.g., from a top-down video of the piano keyboard). In this work we propose a VPT system based on the vision transformer (ViT), which surpasses previous methods based on convolutional neural networks (CNNs). Our system is trained on the newly introduced R3 dataset, consisting of ca.~31 hours of synchronized video and MIDI recordings of piano performances. We additionally introduce an approach to predict note offsets, which has not been previously explored in this context. We show that our system outperforms the state-of-the-art on the PianoYT dataset for onset prediction and on the R3 dataset for both onsets and offsets.

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsMusic and Audio Processing · Music Technology and Sound Studies · Speech and Audio Processing

MethodsAttention Is All You Need · Absolute Position Encodings · Label Smoothing · Adam · Residual Connection · Softmax · Linear Layer · Dropout · Layer Normalization · Multi-Head Attention