Syllabic Agglutinative Tokenizations for Indonesian LLM: A Study from Gasing Literacy Learning System
H. Situngkir, A.B. Lumbantobing, Y. Surya

TL;DR
This paper introduces a syllable-based tokenization method for Indonesian LLMs that improves efficiency and linguistic alignment by leveraging syllabic units and information-theoretic principles, outperforming traditional tokenizers.
Contribution
It presents a novel syllable-based tokenization framework for Indonesian, combining rule-based segmentation with byte-pair encoding, tailored for morphophonological structure, and demonstrates significant empirical improvements.
Findings
Achieves higher Renyi efficiency (0.74) than multilingual tokenizers.
Maintains higher average token length (3.67 characters) with smaller vocabulary.
Reduces computational burden by internalizing character dependencies.
Abstract
This paper presents a novel syllable-based tokenization approach for Indonesian large language models, inspired by the Gasing Literacy Learning System's pedagogical methodology. Drawing on information-theoretic principles, we develop a tokenization framework that segments Indonesian text at syllable boundaries before applying byte-pair encoding, creating a vocabulary that aligns with the language's morphophonological structure. Our approach first identifies high-frequency syllables through rule-based segmentation, then constructs a compact vocabulary of 3,500 tokens that preserves meaningful linguistic units while maintaining coverage through character-level fallback. Empirical evaluation on Indonesian Wikipedia and folklore corpora from Indonesian Culture Digital Library (PDBI) demonstrates substantial improvements over conventional tokenization methods: the syllable-based approach…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsNatural Language Processing Techniques · Text Readability and Simplification · Topic Modeling
