From N-grams to Pre-trained Multilingual Models For Language   Identification

Thapelo Sindane; Vukosi Marivate

arXiv:2410.08728·cs.CL·October 14, 2024

From N-grams to Pre-trained Multilingual Models For Language Identification

Thapelo Sindane, Vukosi Marivate

PDF

Open Access 2 Repos

TL;DR

This paper compares traditional N-gram models and modern pre-trained multilingual models for language identification across 11 South African languages, highlighting the effectiveness of Serengeti and proposing a lightweight BERT-based model.

Contribution

It provides a comprehensive evaluation of N-gram and pre-trained models for LID, introduces a new lightweight BERT-based LID model, and demonstrates Serengeti's superior performance.

Findings

01

Serengeti outperforms other models in LID accuracy.

02

Effective data size selection improves N-gram model performance.

03

The lightweight za_BERT_lid matches the best Afri-centric models.

Abstract

In this paper, we investigate the use of N-gram models and Large Pre-trained Multilingual models for Language Identification (LID) across 11 South African languages. For N-gram models, this study shows that effective data size selection remains crucial for establishing effective frequency distributions of the target languages, that efficiently model each language, thus, improving language ranking. For pre-trained multilingual models, we conduct extensive experiments covering a diverse set of massively pre-trained multilingual (PLM) models -- mBERT, RemBERT, XLM-r, and Afri-centric multilingual models -- AfriBERTa, Afro-XLMr, AfroLM, and Serengeti. We further compare these models with available large-scale Language Identification tools: Compact Language Detector v3 (CLD V3), AfroLID, GlotLID, and OpenLID to highlight the importance of focused-based LID. From these, we show that Serengeti…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsAuthorship Attribution and Profiling · Natural Language Processing Techniques

MethodsSparse Evolutionary Training · mBERT