Exploiting Music Source Separation for Automatic Lyrics Transcription with Whisper
Jaza Syed, Ivan Meresman Higgs, Ond\v{r}ej C\'ifka, Mark Sandler

TL;DR
This paper investigates how music source separation can enhance automatic lyrics transcription (ALT) using Whisper, demonstrating improved accuracy and establishing best practices for short and long-form transcriptions without additional training.
Contribution
It systematically evaluates the impact of source separation on ALT performance with Whisper and introduces new algorithms for segmenting long-form lyrics, achieving state-of-the-art results.
Findings
Source separation reduces Word Error Rate in ALT tasks.
Proposed segmentation algorithm improves long-form transcription accuracy.
Achieved state-of-the-art results on Jam-ALT benchmark.
Abstract
Automatic lyrics transcription (ALT) remains a challenging task in the field of music information retrieval, despite great advances in automatic speech recognition (ASR) brought about by transformer-based architectures in recent years. One of the major challenges in ALT is the high amplitude of interfering audio signals relative to conventional ASR due to musical accompaniment. Recent advances in music source separation have enabled automatic extraction of high-quality separated vocals, which could potentially improve ALT performance. However, the effect of source separation has not been systematically investigated in order to establish best practices for its use. This work examines the impact of source separation on ALT using Whisper, a state-of-the-art open source ASR model. We evaluate Whisper's performance on original audio, separated vocals, and vocal stems across short-form and…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsMusic and Audio Processing · Speech Recognition and Synthesis · Music Technology and Sound Studies
