Linear Semantic Segmentation for Low-Resource Spoken Dialects

Kirill Chirkunov; Younes Samih; Abed Alhakim Freihat; Hanan Aldarmaki

arXiv:2605.06276·cs.CL·May 8, 2026

Linear Semantic Segmentation for Low-Resource Spoken Dialects

Kirill Chirkunov, Younes Samih, Abed Alhakim Freihat, Hanan Aldarmaki

PDF

TL;DR

This paper introduces a new benchmark and segmentation model tailored for low-resource spoken dialects of Arabic, addressing challenges like informal syntax and code-switching.

Contribution

It provides a multi-genre benchmark for dialectal Arabic and proposes a segmentation model that improves robustness and local coherence in low-resource spoken dialects.

Findings

01

Segmentation models trained on MSA news genres perform poorly on dialectal speech.

02

The proposed model outperforms strong baselines on dialectal non-news genres.

03

The benchmark and approach are applicable to other low-resource spoken languages.

Abstract

Semantic segmentation is a core component of discourse analysis, yet existing models are primarily developed and evaluated on high-resource written text, limiting their effectiveness on low-resource spoken varieties. In particular, dialectal Arabic exhibits informal syntax, code-switching, and weakly marked discourse structure that challenge standard segmentation approaches. In this paper, we introduce a new multi-genre benchmark (more than 1000 samples) for semantic segmentation in conversational Arabic, focusing on dialectal discourse. The benchmark covers transcribed casual telephone conversations, code-switched podcasts, broadcast news, and expressive dialogue from novels, and was annotated and validated by native Arabic annotators. Using this benchmark, we show that segmentation models performing well on MSA news genres degrade on dialectal transcribed speech. We further propose a…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.