Linear Semantic Segmentation for Low-Resource Spoken Dialects
Kirill Chirkunov, Younes Samih, Abed Alhakim Freihat, Hanan Aldarmaki

TL;DR
This paper introduces a new benchmark and segmentation model tailored for low-resource spoken dialects of Arabic, addressing challenges like informal syntax and code-switching.
Contribution
It provides a multi-genre benchmark for dialectal Arabic and proposes a segmentation model that improves robustness and local coherence in low-resource spoken dialects.
Findings
Segmentation models trained on MSA news genres perform poorly on dialectal speech.
The proposed model outperforms strong baselines on dialectal non-news genres.
The benchmark and approach are applicable to other low-resource spoken languages.
Abstract
Semantic segmentation is a core component of discourse analysis, yet existing models are primarily developed and evaluated on high-resource written text, limiting their effectiveness on low-resource spoken varieties. In particular, dialectal Arabic exhibits informal syntax, code-switching, and weakly marked discourse structure that challenge standard segmentation approaches. In this paper, we introduce a new multi-genre benchmark (more than 1000 samples) for semantic segmentation in conversational Arabic, focusing on dialectal discourse. The benchmark covers transcribed casual telephone conversations, code-switched podcasts, broadcast news, and expressive dialogue from novels, and was annotated and validated by native Arabic annotators. Using this benchmark, we show that segmentation models performing well on MSA news genres degrade on dialectal transcribed speech. We further propose a…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
