Shona spaCy: A Morphological Analyzer for an Under-Resourced Bantu Language
Happymore Masoka

TL;DR
Shona spaCy is an open-source, rule-based morphological analyzer for the under-resourced Shona language, combining linguistic rules and lexicons to improve NLP tools and accessibility.
Contribution
It introduces a novel, linguistically grounded morphological pipeline for Shona, integrating grammar rules with computational tools to enhance language processing capabilities.
Findings
Achieved 90% POS-tagging accuracy on Shona corpora
Attained 88% accuracy in morphological feature recognition
Provides an accessible toolkit for Shona NLP applications
Abstract
Despite rapid advances in multilingual natural language processing (NLP), the Bantu language Shona remains under-served in terms of morphological analysis and language-aware tools. This paper presents Shona spaCy, an open-source, rule-based morphological pipeline for Shona built on the spaCy framework. The system combines a curated JSON lexicon with linguistically grounded rules to model noun-class prefixes (Mupanda 1-18), verbal subject concords, tense-aspect markers, ideophones, and clitics, integrating these into token-level annotations for lemma, part-of-speech, and morphological features. The toolkit is available via pip install shona-spacy, with source code at https://github.com/HappymoreMasoka/shona-spacy and a PyPI release at https://pypi.org/project/shona-spacy/0.1.4/. Evaluation on formal and informal Shona corpora yields 90% POS-tagging accuracy and 88% morphological-feature…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsNatural Language Processing Techniques · ICT in Developing Communities · Language and cultural evolution
