Do\u{g}al Dil \.I\c{s}lemede Tokenizasyon Standartlar{\i} ve \"Ol\c{c}\"um\"u: T\"urk\c{c}e \"Uzerinden B\"uy\"uk Dil Modellerinin Kar\c{s}{\i}la\c{s}t{\i}rmal{\i} Analizi
M. Ali Bayram, Ali Arda Fincan, Ahmet Semih G\"um\"u\c{s}, Sercan Karaka\c{s}, Banu Diri, Sava\c{s} Y{\i}ld{\i}r{\i}m

TL;DR
This paper presents a new evaluation framework for tokenization in morphologically-rich languages like Turkish, emphasizing language-specific metrics and demonstrating their impact on large language model performance.
Contribution
It introduces a novel set of metrics and a framework for assessing tokenization quality tailored to morphologically complex, low-resource languages.
Findings
Language-specific token percentages correlate strongly with downstream task performance.
Increasing model size alone does not improve linguistic understanding.
Proposed standards improve tokenization effectiveness for Turkish.
Abstract
Tokenization is a fundamental preprocessing step in Natural Language Processing (NLP), significantly impacting the capability of large language models (LLMs) to capture linguistic and semantic nuances. This study introduces a novel evaluation framework addressing tokenization challenges specific to morphologically-rich and low-resource languages such as Turkish. Utilizing the Turkish MMLU (TR-MMLU) dataset, comprising 6,200 multiple-choice questions from the Turkish education system, we assessed tokenizers based on vocabulary size, token count, processing time, language-specific token percentages (\%TR), and token purity (\%Pure). These newly proposed metrics measure how effectively tokenizers preserve linguistic structures. Our analysis reveals that language-specific token percentages exhibit a stronger correlation with downstream performance (e.g., MMLU scores) than token purity.…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
