Enhancing Lyrics Transcription on Music Mixtures with Consistency Loss
Jiawen Huang, Felipe Sousa, Emir Demirel, Emmanouil Benetos, Igor Gadelha

TL;DR
This paper improves automatic lyrics transcription from music mixtures by using a consistency loss to better align vocal and mixture representations, enhancing transcription accuracy without needing singing voice separation.
Contribution
It introduces a novel consistency loss for fine-tuning foundation ASR models on singing voice data, showing structured training benefits in music transcription tasks.
Findings
Consistency loss improves transcription accuracy.
Structured dual-domain training yields consistent gains.
Naive dual-domain fine-tuning underperforms.
Abstract
Automatic Lyrics Transcription (ALT) aims to recognize lyrics from singing voices, similar to Automatic Speech Recognition (ASR) for spoken language, but faces added complexity due to domain-specific properties of the singing voice. While foundation ASR models show robustness in various speech tasks, their performance degrades on singing voice, especially in the presence of musical accompaniment. This work focuses on this performance gap and explores Low-Rank Adaptation (LoRA) for ALT, investigating both single-domain and dual-domain fine-tuning strategies. We propose using a consistency loss to better align vocal and mixture encoder representations, improving transcription on mixture without relying on singing voice separation. Our results show that while na\"ive dual-domain fine-tuning underperforms, structured training with consistency loss yields modest but consistent gains,…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsMusic and Audio Processing · Music Technology and Sound Studies
MethodsALIGN
