CAFE A Novel Code switching Dataset for Algerian Dialect French and English
Houssam Eddine-Othman Lachemat, Akli Abbas, Nourredine Oukas, Yassine, El Kheir, Samia Haboussi, Absar Chowdhury Shammur

TL;DR
This paper introduces CAFE, a pioneering dataset capturing spontaneous code-switching among Algerian dialect, French, and English, and evaluates ASR models' performance on this complex multilingual data.
Contribution
The paper presents the first spontaneous code-switching dataset for Algerian dialect, French, and English, along with benchmarking of state-of-the-art ASR models on this challenging data.
Findings
ASR models face challenges with code-switching and dialectal variations.
Data processing and decoding techniques improve ASR performance.
CAFE dataset enables future research in multilingual speech recognition.
Abstract
The paper introduces and publicly releases (Data download link available after acceptance) CAFE -- the first Code-switching dataset between Algerian dialect, French, and english languages. The CAFE speech data is unique for (a) its spontaneous speaking style in vivo human-human conversation capturing phenomena like code-switching and overlapping speech, (b) addresses distinct linguistic challenges in North African Arabic dialect; (c) the CAFE captures dialectal variations from various parts of Algeria within different sociolinguistic contexts. CAFE data contains approximately 37 hours of speech, with a subset, CAFE-small, of 2 hours and 36 minutes released with manual human annotation including speech segmentation, transcription, explicit annotation of code-switching points, overlapping speech, and other events such as noises, and laughter among others. The rest approximately 34.58…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsNatural Language Processing Techniques · Linguistic Variation and Morphology · Language, Linguistics, Cultural Analysis
