Optimal Turkish Subword Strategies at Scale: Systematic Evaluation of Data, Vocabulary, Morphology Interplay
Duygu Altinok

TL;DR
This study systematically evaluates Turkish subword tokenization strategies, analyzing how vocabulary size and training data affect model performance across various linguistic tasks, and introduces diagnostics to understand tokenizer success or failure.
Contribution
It provides the first comprehensive, controlled analysis of Turkish subword tokenization, linking intrinsic diagnostics to downstream task performance and offering open-source tools.
Findings
Character-level tokenization benefits morphology-sensitive tasks.
Vocabulary size impacts semantic and syntactic task performance.
Morphology-aware diagnostics reveal segmentation quality issues.
Abstract
Tokenization is a pivotal design choice for neural language modeling in morphologically rich languages (MRLs) such as Turkish, where productive agglutination challenges both vocabulary efficiency and morphological fidelity. Prior studies have explored tokenizer families and vocabulary sizes but typically (i) vary vocabulary without systematically controlling the tokenizer's training corpus, (ii) provide limited intrinsic diagnostics, and (iii) evaluate a narrow slice of downstream tasks. We present the first comprehensive, principled study of Turkish subword tokenization; a "subwords manifest", that jointly varies vocabulary size and tokenizer training corpus size (data and vocabulary coupling), compares multiple tokenizer families under matched parameter budgets (WordPiece, morphology level, and character baselines), and evaluates across semantic (NLI, STS, sentiment analysis, NER),…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsNatural Language Processing Techniques · Text Readability and Simplification · Topic Modeling
