Subword Segmental Language Modelling for Nguni Languages
Francois Meyer, Jan Buys

TL;DR
This paper introduces a novel subword segmental language model (SSLM) that jointly learns subword segmentation and language modeling, significantly improving performance on low-resource Nguni languages by discovering morpheme-like subwords.
Contribution
The paper proposes SSLM, a unified model that learns optimal subword segmentation during language model training, outperforming traditional BPE and morphological segmenters on low-resource agglutinative languages.
Findings
SSLM outperforms BPE-based models on Nguni languages
SSLM surpasses standard morphological segmenters in unsupervised segmentation
Word-level SSLM acts as an effective morphological segmenter
Abstract
Subwords have become the standard units of text in NLP, enabling efficient open-vocabulary models. With algorithms like byte-pair encoding (BPE), subword segmentation is viewed as a preprocessing step applied to the corpus before training. This can lead to sub-optimal segmentations for low-resource languages with complex morphologies. We propose a subword segmental language model (SSLM) that learns how to segment words while being trained for autoregressive language modelling. By unifying subword segmentation and language modelling, our model learns subwords that optimise LM performance. We train our model on the 4 Nguni languages of South Africa. These are low-resource agglutinative languages, so subword information is critical. As an LM, SSLM outperforms existing approaches such as BPE-based models on average across the 4 languages. Furthermore, it outperforms standard subword…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsNatural Language Processing Techniques · Topic Modeling · Text Readability and Simplification
