HeceTokenizer: A Syllable-Based Tokenization Approach for Turkish Retrieval

Senol Gulgonul

arXiv:2604.10665·cs.CL·April 14, 2026

HeceTokenizer: A Syllable-Based Tokenization Approach for Turkish Retrieval

Senol Gulgonul

PDF

TL;DR

HeceTokenizer introduces a syllable-based Turkish tokenizer leveraging phonological patterns, enabling effective retrieval with a small model and outperforming larger morphology-based baselines.

Contribution

The paper presents a novel syllable-based tokenization method for Turkish that exploits phonological regularities to improve retrieval performance with a lightweight model.

Findings

01

Achieves 50.3% Recall@5 on TQuAD benchmark.

02

Outperforms larger morphology-driven models in retrieval tasks.

03

Uses approximately 8,000 syllable types for vocabulary construction.

Abstract

HeceTokenizer is a syllable-based tokenizer for Turkish that exploits the deterministic six-pattern phonological structure of the language to construct a closed, out-of-vocabulary (OOV)-free vocabulary of approximately 8,000 unique syllable types. A BERT-tiny encoder (1.5M parameters) is trained from scratch on a subset of Turkish Wikipedia using a masked language modeling objective and evaluated on the TQuAD retrieval benchmark using Recall@5. Combined with a fine-grained chunk-based retrieval strategy, HeceTokenizer achieves 50.3% Recall@5, surpassing the 46.92% reported by a morphology-driven baseline that uses a 200 times larger model. These results suggest that the phonological regularity of Turkish syllables provides a strong and resource-light inductive bias for retrieval tasks.

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.