Optimal Turkish Subword Strategies at Scale: Systematic Evaluation of Data, Vocabulary, Morphology Interplay

Duygu Altinok

arXiv:2602.06942·cs.CL·February 9, 2026

Optimal Turkish Subword Strategies at Scale: Systematic Evaluation of Data, Vocabulary, Morphology Interplay

Duygu Altinok

PDF

Open Access 2 Models

TL;DR

This study systematically evaluates Turkish subword tokenization strategies, analyzing how vocabulary size and training data affect model performance across various linguistic tasks, and introduces diagnostics to understand tokenizer success or failure.

Contribution

It provides the first comprehensive, controlled analysis of Turkish subword tokenization, linking intrinsic diagnostics to downstream task performance and offering open-source tools.

Findings

01

Character-level tokenization benefits morphology-sensitive tasks.

02

Vocabulary size impacts semantic and syntactic task performance.

03

Morphology-aware diagnostics reveal segmentation quality issues.

Abstract

Tokenization is a pivotal design choice for neural language modeling in morphologically rich languages (MRLs) such as Turkish, where productive agglutination challenges both vocabulary efficiency and morphological fidelity. Prior studies have explored tokenizer families and vocabulary sizes but typically (i) vary vocabulary without systematically controlling the tokenizer's training corpus, (ii) provide limited intrinsic diagnostics, and (iii) evaluate a narrow slice of downstream tasks. We present the first comprehensive, principled study of Turkish subword tokenization; a "subwords manifest", that jointly varies vocabulary size and tokenizer training corpus size (data and vocabulary coupling), compares multiple tokenizer families under matched parameter budgets (WordPiece, morphology level, and character baselines), and evaluates across semantic (NLI, STS, sentiment analysis, NER),…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Models

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsNatural Language Processing Techniques · Text Readability and Simplification · Topic Modeling