Subword Tokenization Strategies for Kurdish Word Embeddings

Ali Salehi; Cassandra L. Jacobs

arXiv:2511.14696·cs.CL·November 19, 2025

Subword Tokenization Strategies for Kurdish Word Embeddings

Ali Salehi, Cassandra L. Jacobs

PDF

Open Access

TL;DR

This paper compares different tokenization strategies for Kurdish word embeddings, revealing that morpheme-based methods outperform BPE in comprehensive evaluations despite initial appearances.

Contribution

It introduces a morphological segmenter for Kurdish and provides a thorough evaluation of tokenization methods, emphasizing coverage-aware assessment in low-resource languages.

Findings

01

Morpheme-based tokenization yields better semantic organization.

02

BPE shows inflated performance due to limited test coverage.

03

Coverage-aware evaluation is crucial for low-resource language processing.

Abstract

We investigate tokenization strategies for Kurdish word embeddings by comparing word-level, morpheme-based, and BPE approaches on morphological similarity preservation tasks. We develop a BiLSTM-CRF morphological segmenter using bootstrapped training from minimal manual annotation and evaluate Word2Vec embeddings across comprehensive metrics including similarity preservation, clustering quality, and semantic organization. Our analysis reveals critical evaluation biases in tokenization comparison. While BPE initially appears superior in morphological similarity, it evaluates only 28.6\% of test cases compared to 68.7\% for morpheme model, creating artificial performance inflation. When assessed comprehensively, morpheme-based tokenization demonstrates superior embedding space organization, better semantic neighborhood structure, and more balanced coverage across morphological complexity…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsNatural Language Processing Techniques · Topic Modeling · Authorship Attribution and Profiling