TL;DR
This paper presents a linguistically informed hybrid tokenizer for Turkish that improves morphological segmentation and outperforms general-purpose tokenizers on multiple NLP benchmarks.
Contribution
The authors introduce a novel Turkish tokenizer combining morphological, phonological, and subword techniques, with a comprehensive vocabulary and evaluation metrics.
Findings
Tokenizer achieves 90.29% Turkish Token Percentage on TR-MMLU.
Outperforms baselines on Turkish STS Benchmark and MTEB-TR.
Yields strongest accuracy on TurBLiMP under a proxy.
Abstract
Tokenization shapes how language models perceive morphology and meaning in NLP, yet widely used frequency-driven subword tokenizers (e.g., Byte Pair Encoding and WordPiece) can fragment morphologically rich and agglutinative languages in ways that obscure morpheme boundaries. We introduce a linguistically informed hybrid tokenizer for Turkish that combines (i) dictionary-driven morphological segmentation (roots and affixes), (ii) phonological normalization that maps allomorphic variants to shared identifiers, and (iii) a controlled subword fallback for out-of-vocabulary coverage. Concretely, our released Turkish vocabulary contains 22,231 root tokens mapped to 20,000 canonical root identifiers (with leading spaces to mark word boundaries), 72 affix identifiers that cover 177 allomorphic surface forms, and 12,696 subword units; an orthographic case token preserves capitalization without…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
