Shona spaCy: A Morphological Analyzer for an Under-Resourced Bantu Language

Happymore Masoka

arXiv:2511.16680·cs.CL·November 24, 2025

Shona spaCy: A Morphological Analyzer for an Under-Resourced Bantu Language

Happymore Masoka

PDF

Open Access

TL;DR

Shona spaCy is an open-source, rule-based morphological analyzer for the under-resourced Shona language, combining linguistic rules and lexicons to improve NLP tools and accessibility.

Contribution

It introduces a novel, linguistically grounded morphological pipeline for Shona, integrating grammar rules with computational tools to enhance language processing capabilities.

Findings

01

Achieved 90% POS-tagging accuracy on Shona corpora

02

Attained 88% accuracy in morphological feature recognition

03

Provides an accessible toolkit for Shona NLP applications

Abstract

Despite rapid advances in multilingual natural language processing (NLP), the Bantu language Shona remains under-served in terms of morphological analysis and language-aware tools. This paper presents Shona spaCy, an open-source, rule-based morphological pipeline for Shona built on the spaCy framework. The system combines a curated JSON lexicon with linguistically grounded rules to model noun-class prefixes (Mupanda 1-18), verbal subject concords, tense-aspect markers, ideophones, and clitics, integrating these into token-level annotations for lemma, part-of-speech, and morphological features. The toolkit is available via pip install shona-spacy, with source code at https://github.com/HappymoreMasoka/shona-spacy and a PyPI release at https://pypi.org/project/shona-spacy/0.1.4/. Evaluation on formal and informal Shona corpora yields 90% POS-tagging accuracy and 88% morphological-feature…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsNatural Language Processing Techniques · ICT in Developing Communities · Language and cultural evolution