TL;DR
This paper introduces VietLyrics, the first large-scale Vietnamese lyrics dataset, and demonstrates how fine-tuning Whisper models on this data improves automatic lyrics transcription for Vietnamese music.
Contribution
The creation of the first large-scale Vietnamese ALT dataset and the demonstration of fine-tuned Whisper models outperforming existing systems.
Findings
Fine-tuned Whisper models achieve better transcription accuracy.
Current ASR approaches face significant errors and hallucinations.
VietLyrics dataset enables research in low-resource language ALT.
Abstract
Automatic Lyrics Transcription (ALT) for Vietnamese music presents unique challenges due to its tonal complexity and dialectal variations, but remains largely unexplored due to the lack of a dedicated dataset. Therefore, we curated the first large-scale Vietnamese ALT dataset (VietLyrics), comprising 647 hours of songs with line-level aligned lyrics and metadata to address these issues. Our evaluation of current ASRbased approaches reveal significant limitations, including frequent transcription errors and hallucinations in non-vocal segments. To improve performance, we fine-tuned Whisper models on the VietLyrics dataset, achieving superior results compared to existing multilingual ALT systems, including LyricWhiz. We publicly release VietLyrics and our models, aiming to advance Vietnamese music computing research while demonstrating the potential of this approach for ALT in…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
