From N-grams to Pre-trained Multilingual Models For Language Identification
Thapelo Sindane, Vukosi Marivate

TL;DR
This paper compares traditional N-gram models and modern pre-trained multilingual models for language identification across 11 South African languages, highlighting the effectiveness of Serengeti and proposing a lightweight BERT-based model.
Contribution
It provides a comprehensive evaluation of N-gram and pre-trained models for LID, introduces a new lightweight BERT-based LID model, and demonstrates Serengeti's superior performance.
Findings
Serengeti outperforms other models in LID accuracy.
Effective data size selection improves N-gram model performance.
The lightweight za_BERT_lid matches the best Afri-centric models.
Abstract
In this paper, we investigate the use of N-gram models and Large Pre-trained Multilingual models for Language Identification (LID) across 11 South African languages. For N-gram models, this study shows that effective data size selection remains crucial for establishing effective frequency distributions of the target languages, that efficiently model each language, thus, improving language ranking. For pre-trained multilingual models, we conduct extensive experiments covering a diverse set of massively pre-trained multilingual (PLM) models -- mBERT, RemBERT, XLM-r, and Afri-centric multilingual models -- AfriBERTa, Afro-XLMr, AfroLM, and Serengeti. We further compare these models with available large-scale Language Identification tools: Compact Language Detector v3 (CLD V3), AfroLID, GlotLID, and OpenLID to highlight the importance of focused-based LID. From these, we show that Serengeti…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsAuthorship Attribution and Profiling · Natural Language Processing Techniques
MethodsSparse Evolutionary Training · mBERT
