Zero-Shot Morphological Discovery in Low-Resource Bantu Languages via Cross-Lingual Transfer and Unsupervised Clustering

Hillary Mutisya; John Mugane

arXiv:2604.22723·cs.LG·April 27, 2026

Zero-Shot Morphological Discovery in Low-Resource Bantu Languages via Cross-Lingual Transfer and Unsupervised Clustering

Hillary Mutisya, John Mugane

PDF

1 Models

TL;DR

This paper introduces a novel method combining cross-lingual transfer learning and unsupervised clustering to discover morphological features in low-resource Bantu languages, validated on Giriama.

Contribution

It presents a new pipeline that effectively uncovers morphological patterns and improves lemmatization accuracy in low-resource languages using transfer and clustering.

Findings

01

Discovered two undocumented morphological patterns in Giriama.

02

Achieved 78.2% lemmatization accuracy on known paradigms.

03

Reaches 97.3% segmentation and 86.7% lemmatization on expanded corpus.

Abstract

We present a method for discovering morphological features in low-resource Bantu languages by combining cross-lingual transfer learning with unsupervised clustering. Applied to Giriama (nyf), a language with only 91 labeled paradigms, our pipeline discovers noun class assignments for 2,455 words and identifies two previously undocumented morphological patterns: an a- prefix variant for Class 2 (vowel coalescence - the merger of two adjacent vowels - of wa-, 95.1% consistency) and a contracted k'- prefix (98.5% consistency). External validation on 444 known Giriama verb paradigms confirms 78.2% lemmatization accuracy, while a v3 corpus expansion to 19,624 words (9,014 unique lemmas) achieves 97.3% segmentation and 86.7% lemmatization rates across all major word classes. Our ensemble of transfer learning from Swahili and unsupervised clustering, combined via weighted voting, exploits…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Models

🤗
thiomi/bantumorph-v7
model· 72 dl
72 dl

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.