Exploiting Music Source Separation for Automatic Lyrics Transcription with Whisper

Jaza Syed; Ivan Meresman Higgs; Ond\v{r}ej C\'ifka; Mark Sandler

arXiv:2506.15514·cs.SD·June 19, 2025

Exploiting Music Source Separation for Automatic Lyrics Transcription with Whisper

Jaza Syed, Ivan Meresman Higgs, Ond\v{r}ej C\'ifka, Mark Sandler

PDF

Open Access 1 Repo 5 Datasets

TL;DR

This paper investigates how music source separation can enhance automatic lyrics transcription (ALT) using Whisper, demonstrating improved accuracy and establishing best practices for short and long-form transcriptions without additional training.

Contribution

It systematically evaluates the impact of source separation on ALT performance with Whisper and introduces new algorithms for segmenting long-form lyrics, achieving state-of-the-art results.

Findings

01

Source separation reduces Word Error Rate in ALT tasks.

02

Proposed segmentation algorithm improves long-form transcription accuracy.

03

Achieved state-of-the-art results on Jam-ALT benchmark.

Abstract

Automatic lyrics transcription (ALT) remains a challenging task in the field of music information retrieval, despite great advances in automatic speech recognition (ASR) brought about by transformer-based architectures in recent years. One of the major challenges in ALT is the high amplitude of interfering audio signals relative to conventional ASR due to musical accompaniment. Recent advances in music source separation have enabled automatic extraction of high-quality separated vocals, which could potentially improve ALT performance. However, the effect of source separation has not been systematically investigated in order to establish best practices for its use. This work examines the impact of source separation on ALT using Whisper, a state-of-the-art open source ASR model. We evaluate Whisper's performance on original audio, separated vocals, and vocal stems across short-form and…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

jaza-syed/mss-alt
pytorchOfficial

Datasets

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsMusic and Audio Processing · Speech Recognition and Synthesis · Music Technology and Sound Studies