Tokens with Meaning: A Hybrid Tokenization Approach for Turkish

M. Ali Bayram; Ali Arda Fincan; Ahmet Semih G\"um\"u\c{s}; Sercan Karaka\c{s}; Banu Diri; Sava\c{s} Y{\i}ld{\i}r{\i}m; Demircan \c{C}elik

arXiv:2508.14292·cs.CL·April 1, 2026

Tokens with Meaning: A Hybrid Tokenization Approach for Turkish

M. Ali Bayram, Ali Arda Fincan, Ahmet Semih G\"um\"u\c{s}, Sercan Karaka\c{s}, Banu Diri, Sava\c{s} Y{\i}ld{\i}r{\i}m, Demircan \c{C}elik

PDF

1 Models

TL;DR

This paper presents a linguistically informed hybrid tokenizer for Turkish that improves morphological segmentation and outperforms general-purpose tokenizers on multiple NLP benchmarks.

Contribution

The authors introduce a novel Turkish tokenizer combining morphological, phonological, and subword techniques, with a comprehensive vocabulary and evaluation metrics.

Findings

01

Tokenizer achieves 90.29% Turkish Token Percentage on TR-MMLU.

02

Outperforms baselines on Turkish STS Benchmark and MTEB-TR.

03

Yields strongest accuracy on TurBLiMP under a proxy.

Abstract

Tokenization shapes how language models perceive morphology and meaning in NLP, yet widely used frequency-driven subword tokenizers (e.g., Byte Pair Encoding and WordPiece) can fragment morphologically rich and agglutinative languages in ways that obscure morpheme boundaries. We introduce a linguistically informed hybrid tokenizer for Turkish that combines (i) dictionary-driven morphological segmentation (roots and affixes), (ii) phonological normalization that maps allomorphic variants to shared identifiers, and (iii) a controlled subword fallback for out-of-vocabulary coverage. Concretely, our released Turkish vocabulary contains 22,231 root tokens mapped to 20,000 canonical root identifiers (with leading spaces to mark word boundaries), 72 affix identifiers that cover 177 allomorphic surface forms, and 12,696 subword units; an orthographic case token preserves capitalization without…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Models

🤗
ikaganacar/ismail
model· ♡ 1
♡ 1

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.