Introducing Syllable Tokenization for Low-resource Languages: A Case Study with Swahili
Jesse Atuhurra, Hiroyuki Shindo, Hidetaka Kamigaito, Taro Watanabe

TL;DR
This paper introduces a syllable-based tokenization method for low-resource languages like Swahili, demonstrating its effectiveness in creating better language models compared to traditional subword tokenization methods.
Contribution
It proposes a novel syllable tokenizer and validates its effectiveness through experiments with GPT2 on Swahili, extending subword tokenization approaches.
Findings
Syllable tokenizer produces meaningful syllable embeddings for Swahili.
Syllable-based models outperform traditional subword models in text generation tasks.
The approach enhances multilingual NLP for syllable-rich low-resource languages.
Abstract
Many attempts have been made in multilingual NLP to ensure that pre-trained language models, such as mBERT or GPT2 get better and become applicable to low-resource languages. To achieve multilingualism for pre-trained language models (PLMs), we need techniques to create word embeddings that capture the linguistic characteristics of any language. Tokenization is one such technique because it allows for the words to be split based on characters or subwords, creating word embeddings that best represent the structure of the language. Creating such word embeddings is essential to applying PLMs to other languages where the model was not trained, enabling multilingual NLP. However, most PLMs use generic tokenization methods like BPE, wordpiece, or unigram which may not suit specific languages. We hypothesize that tokenization based on syllables within the input text, which we call syllable…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsMultilingual Education and Policy · ICT in Developing Communities
MethodsByte Pair Encoding · mBERT
