Audio-to-Score Conversion Model Based on Whisper methodology
Hongyao Zhang, Bohang Sun

TL;DR
This paper presents a Transformer-based model leveraging Whisper for converting music audio into ABC notation, introducing a new notation system and tokenizer, with improved accuracy over traditional methods.
Contribution
It introduces the Orpheus' Score notation system, a custom tokenizer, and a comprehensive data processing workflow for audio-to-score conversion.
Findings
Significantly improved accuracy compared to traditional algorithms
Effective data augmentation through mutation mechanisms
Provides a practical tool for music enthusiasts and researchers
Abstract
This thesis develops a Transformer model based on Whisper, which extracts melodies and chords from music audio and records them into ABC notation. A comprehensive data processing workflow is customized for ABC notation, including data cleansing, formatting, and conversion, and a mutation mechanism is implemented to increase the diversity and quality of training data. This thesis innovatively introduces the "Orpheus' Score", a custom notation system that converts music information into tokens, designs a custom vocabulary library, and trains a corresponding custom tokenizer. Experiments show that compared to traditional algorithms, the model has significantly improved accuracy and performance. While providing a convenient audio-to-score tool for music enthusiasts, this work also provides new ideas and tools for research in music information processing.
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsMusic and Audio Processing · Music Technology and Sound Studies · Speech Recognition and Synthesis
MethodsAttention Is All You Need · Linear Layer · Layer Normalization · Residual Connection · Position-Wise Feed-Forward Layer · Dense Connections · Softmax · Multi-Head Attention · Adam · Dropout
