Multi-modal Conditional Bounding Box Regression for Music Score Following
Florian Henkel, Gerhard Widmer

TL;DR
This paper introduces a novel neural network approach inspired by object detection for real-time score following in sheet music images, achieving state-of-the-art accuracy and robustness on synthetic and real piano recordings.
Contribution
A new conditional neural network architecture for on-line audio-to-score alignment that directly predicts score positions from sheet images, improving accuracy over existing methods.
Findings
Achieves state-of-the-art results on synthetic datasets.
Significantly improves real-world piano alignment with data augmentation.
Outperforms existing score following approaches and OMR baselines.
Abstract
This paper addresses the problem of sheet-image-based on-line audio-to-score alignment also known as score following. Drawing inspiration from object detection, a conditional neural network architecture is proposed that directly predicts x,y coordinates of the matching positions in a complete score sheet image at each point in time for a given musical performance. Experiments are conducted on a synthetic polyphonic piano benchmark dataset and the new method is compared to several existing approaches from the literature for sheet-image-based score following as well as an Optical Music Recognition baseline. The proposed approach achieves new state-of-the-art results and furthermore significantly improves the alignment performance on a set of real-world piano recordings by applying Impulse Responses as a data augmentation technique.
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsMusic and Audio Processing · Music Technology and Sound Studies · Speech and Audio Processing
