Improved Text Language Identification for the South African Languages

Bernardt Duvenhage; Mfundo Ntini; Phala Ramonyai

arXiv:1711.00247·cs.CL·November 2, 2017

Improved Text Language Identification for the South African Languages

Bernardt Duvenhage, Mfundo Ntini, Phala Ramonyai

PDF

1 Repo

TL;DR

This paper presents a combined naive Bayes and lexicon-based approach to improve language identification accuracy for South African languages in short texts, achieving a 31% error reduction.

Contribution

It introduces a novel hybrid classifier specifically tailored for South African languages, enhancing accuracy in short text language detection.

Findings

01

31% reduction in language detection error

02

Effective for short text messages

03

Open-source datasets and code provided

Abstract

Virtual assistants and text chatbots have recently been gaining popularity. Given the short message nature of text-based chat interactions, the language identification systems of these bots might only have 15 or 20 characters to make a prediction. However, accurate text language identification is important, especially in the early stages of many multilingual natural language processing pipelines. This paper investigates the use of a naive Bayes classifier, to accurately predict the language family that a piece of text belongs to, combined with a lexicon based classifier to distinguish the specific South African language that the text is written in. This approach leads to a 31% reduction in the language detection error. In the spirit of reproducible research the training and testing datasets as well as the code are published on github. Hopefully it will be useful to create a text…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

praekelt/feersum-lid-shared-task
noneOfficial

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.