End-to-end Lyrics Alignment for Polyphonic Music Using an   Audio-to-Character Recognition Model

Daniel Stoller; Simon Durand; Sebastian Ewert

arXiv:1902.06797·cs.SD·February 20, 2019·1 cites

End-to-end Lyrics Alignment for Polyphonic Music Using an Audio-to-Character Recognition Model

Daniel Stoller, Simon Durand, Sebastian Ewert

PDF

Open Access 2 Repos 3 Datasets

TL;DR

This paper introduces an end-to-end audio-to-character recognition system for lyrics alignment in polyphonic music, achieving high accuracy without complex sub-modules or fine-grained annotations.

Contribution

A novel Wave-U-Net based model that directly predicts character probabilities from raw audio, simplifying the process and working with weak annotations.

Findings

01

Achieves a mean alignment error of 0.35s on a standard dataset.

02

Outperforms state-of-the-art methods by an order of magnitude.

03

Operates without sub-modules like vocal separation or detection.

Abstract

Time-aligned lyrics can enrich the music listening experience by enabling karaoke, text-based song retrieval and intra-song navigation, and other applications. Compared to text-to-speech alignment, lyrics alignment remains highly challenging, despite many attempts to combine numerous sub-modules including vocal separation and detection in an effort to break down the problem. Furthermore, training required fine-grained annotations to be available in some form. Here, we present a novel system based on a modified Wave-U-Net architecture, which predicts character probabilities directly from raw audio using learnt multi-scale representations of the various signal components. There are no sub-modules whose interdependencies need to be optimized. Our training procedure is designed to work with weak, line-level annotations available in the real world. With a mean alignment error of 0.35s on a…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

Datasets

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsMusic and Audio Processing · Speech Recognition and Synthesis · Speech and Audio Processing