English to Central Kurdish Speech Translation: Corpus Creation, Evaluation, and Orthographic Standardization
Mohammad Mohammadamini, Daban Q. Jaff, Josep Crego, Marie Tahon, Antoine Laurent

TL;DR
This paper introduces KUTED, a large Central Kurdish speech translation dataset, and proposes a text standardization method that improves translation quality and consistency.
Contribution
The paper presents a new speech-to-text translation dataset for Central Kurdish and a systematic orthographic standardization approach that enhances translation performance.
Findings
Orthographic variation degrades Kurdish translation quality.
Standardization yields significant performance improvements.
Fine-tuned models achieve 15.18 BLEU on the test set.
Abstract
We present KUTED, a speech-to-text translation (S2TT) dataset for Central Kurdish, derived from TED and TEDx talks. The corpus comprises 91,000 sentence pairs, including 170 hours of English audio, 1.65 million English tokens, and 1.40 million Central Kurdish tokens. We evaluate KUTED on the S2TT task and find that orthographic variation significantly degrades Kurdish translation performance, producing nonstandard outputs. To address this, we propose a systematic text standardization approach that yields substantial performance gains and more consistent translations. On a test set separated from TED talks, a fine-tuned Seamless model achieves 15.18 BLEU, and we improve Seamless baseline by 3.0 BLEU on the FLEURS benchmark. We also train a Transformer model from scratch and evaluate a cascaded system that combines Seamless (ASR) with NLLB (MT).
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
