Lexicon and Rule-based Word Lemmatization Approach for the Somali   Language

Shafie Abdi Mohamed; Muhidin Abdullahi Mohamed

arXiv:2308.01785·cs.CL·August 4, 2023·1 cites

Lexicon and Rule-based Word Lemmatization Approach for the Somali Language

Shafie Abdi Mohamed, Muhidin Abdullahi Mohamed

PDF

Open Access 1 Repo

TL;DR

This paper introduces a lexicon and rule-based lemmatization method for Somali, a low-resource language, achieving high accuracy on short texts and laying groundwork for future NLP applications.

Contribution

It develops the first Somali lemmatizer using a lexicon and rules, addressing the language's low-resource status and enabling further NLP research.

Findings

01

95.87% accuracy on social media messages

02

60.57% accuracy on news extracts

03

57% accuracy on full news articles

Abstract

Lemmatization is a Natural Language Processing (NLP) technique used to normalize text by changing morphological derivations of words to their root forms. It is used as a core pre-processing step in many NLP tasks including text indexing, information retrieval, and machine learning for NLP, among others. This paper pioneers the development of text lemmatization for the Somali language, a low-resource language with very limited or no prior effective adoption of NLP methods and datasets. We especially develop a lexicon and rule-based lemmatizer for Somali text, which is a starting point for a full-fledged Somali lemmatization system for various NLP tasks. With consideration of the language morphological rules, we have developed an initial lexicon of 1247 root words and 7173 derivationally related terms enriched with rules for lemmatizing words not present in the lexicon. We have tested the…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

shafieabdi/somalilemmatizer
noneOfficial

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsNatural Language Processing Techniques · Topic Modeling · Text Readability and Simplification