Using Songs to Improve Kazakh Automatic Speech Recognition

Rustem Yeshpanov

arXiv:2603.00961·eess.AS·March 10, 2026

Using Songs to Improve Kazakh Automatic Speech Recognition

Rustem Yeshpanov

PDF

Open Access 1 Datasets

TL;DR

This study investigates using songs as an unconventional data source to improve Kazakh automatic speech recognition, demonstrating that even modest song data can enhance performance in low-resource settings.

Contribution

It introduces a novel approach of leveraging song data for low-resource Kazakh ASR and provides a curated dataset of 3,013 song segments for research.

Findings

01

Song-based fine-tuning improves WER over zero-shot models

02

Mixing songs with other corpora enhances ASR performance

03

Even small song datasets can significantly aid low-resource ASR

Abstract

Developing automatic speech recognition (ASR) systems for low-resource languages is hindered by the scarcity of transcribed corpora. This proof-of-concept study explores songs as an unconventional yet promising data source for Kazakh ASR. We curate a dataset of 3,013 audio-text pairs (about 4.5 hours) from 195 songs by 36 artists, segmented at the lyric-line level. Using Whisper as the base recogniser, we fine-tune models under seven training scenarios involving Songs, Common Voice Corpus (CVC), and FLEURS, and evaluate them on three benchmarks: CVC, FLEURS, and Kazakh Speech Corpus 2 (KSC2). Results show that song-based fine-tuning improves performance over zero-shot baselines. For instance, Whisper Large-V3 Turbo trained on a mixture of Songs, CVC, and FLEURS achieves 27.6% normalised WER on CVC and 11.8% on FLEURS, while halving the error on KSC2 (39.3% vs. 81.2%) relative to the…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Datasets

yeshpanovrustem/kazakh_songs_asr
dataset· 32 dl
32 dl

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsSpeech Recognition and Synthesis · Music and Audio Processing · Speech and Audio Processing