Subword Tokenization Strategies for Kurdish Word Embeddings
Ali Salehi, Cassandra L. Jacobs

TL;DR
This paper compares different tokenization strategies for Kurdish word embeddings, revealing that morpheme-based methods outperform BPE in comprehensive evaluations despite initial appearances.
Contribution
It introduces a morphological segmenter for Kurdish and provides a thorough evaluation of tokenization methods, emphasizing coverage-aware assessment in low-resource languages.
Findings
Morpheme-based tokenization yields better semantic organization.
BPE shows inflated performance due to limited test coverage.
Coverage-aware evaluation is crucial for low-resource language processing.
Abstract
We investigate tokenization strategies for Kurdish word embeddings by comparing word-level, morpheme-based, and BPE approaches on morphological similarity preservation tasks. We develop a BiLSTM-CRF morphological segmenter using bootstrapped training from minimal manual annotation and evaluate Word2Vec embeddings across comprehensive metrics including similarity preservation, clustering quality, and semantic organization. Our analysis reveals critical evaluation biases in tokenization comparison. While BPE initially appears superior in morphological similarity, it evaluates only 28.6\% of test cases compared to 68.7\% for morpheme model, creating artificial performance inflation. When assessed comprehensively, morpheme-based tokenization demonstrates superior embedding space organization, better semantic neighborhood structure, and more balanced coverage across morphological complexity…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsNatural Language Processing Techniques · Topic Modeling · Authorship Attribution and Profiling
