FLEURS-Kobani: Extending the FLEURS Dataset for Northern Kurdish

Daban Q. Jaff; Mohammad Mohammadamini

arXiv:2603.29892·cs.CL·April 1, 2026

FLEURS-Kobani: Extending the FLEURS Dataset for Northern Kurdish

Daban Q. Jaff, Mohammad Mohammadamini

PDF

TL;DR

FLEURS-Kobani introduces a new Northern Kurdish speech dataset extending the FLEURS benchmark, enabling evaluation of ASR and speech translation for this under-resourced language.

Contribution

It provides the first public Northern Kurdish speech dataset and baseline models for ASR and speech translation tasks.

Findings

01

Achieved 28.11 WER in ASR with fine-tuned Whisper v3-large.

02

Attained 8.68 BLEU in speech translation from KMR to English.

03

Dataset includes 5,162 utterances recorded by 31 native speakers.

Abstract

FLEURS offers n-way parallel speech for 100+ languages, but Northern Kurdish is not one of them, which limits benchmarking for automatic speech recognition and speech translation tasks in this language. We present FLEURS-Kobani, a Northern Kurdish (ISO 639-3 KMR) spoken extension of the FLEURS benchmark. The FLEURS-Kobani dataset consists of 5,162 validated utterances, totaling 18 hours and 24 minutes. The data were recorded by 31 native speakers. It extends benchmark coverage to an under-resourced Kurdish variety. As baselines, we fine-tuned Whisper v3-large for ASR and E2E S2TT. A two-stage fine-tuning strategy (Common Voice to FLEURS-Kobani) yields the best ASR performance (WER 28.11, CER 9.84 on test). For E2E S2TT (KMR to EN), Whisper achieves 8.68 BLEU on test; we additionally report pivot-derived targets and a cascaded S2TT setup. FLEURS-Kobani provides the first public Northern…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.