Natural language processing for African languages
David Ifeoluwa Adelani

TL;DR
This paper investigates NLP challenges for African languages, emphasizing data quality, evaluating multilingual models, and creating new datasets for low-resource languages to improve NLP performance and representation.
Contribution
It introduces high-quality corpora, evaluates multilingual models, and develops annotated datasets for 21 African languages, advancing NLP research for low-resource languages.
Findings
Quality of pre-training data impacts semantic representations.
Multilingual PLMs benefit low-resource and unseen languages.
New datasets improve NLP tasks for African languages.
Abstract
Recent advances in word embeddings and language models use large-scale, unlabelled data and self-supervised learning to boost NLP performance. Multilingual models, often trained on web-sourced data like Wikipedia, face challenges: few low-resource languages are included, their data is often noisy, and lack of labeled datasets makes it hard to evaluate performance outside high-resource languages like English. In this dissertation, we focus on languages spoken in Sub-Saharan Africa where all the indigenous languages in this region can be regarded as low-resourced in terms of the availability of labelled data for NLP tasks and unlabelled data found on the web. We analyse the noise in the publicly available corpora, and curate a high-quality corpus, demonstrating that the quality of semantic representations learned in word embeddings does not only depend on the amount of data but on the…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsNatural Language Processing Techniques · Topic Modeling · ICT in Developing Communities
