Tokenizations for Austronesian Language Models: study on languages in Indonesia Archipelago
Andhika Bernard Lumbantobing, Hokky Situngkir

TL;DR
This paper introduces a syllable-based tokenization method inspired by traditional Indonesian scripts, improving linguistic alignment and consistency across Austronesian languages for large language models.
Contribution
It develops a novel syllable-based tokenization framework tailored for Austronesian languages, addressing limitations of subword methods optimized on English corpora.
Findings
Syllable-based tokenization yields consistent TPC across languages.
It increases token sequence similarity by approximately 21% over GPT-2.
The approach better preserves phonological and morphological patterns.
Abstract
Tokenization constitutes a fundamental stage in Large Language Model (LLM) processing; however, subword-based tokenization methods optimized on English-dominant corpora may produce token fragmentation misaligned with the linguistic structures of Austronesian languages. This study aimed to develop a syllable-based tokenization framework adopting principles from traditional Indonesian scripts (aksara) for regional languages of Indonesia. A syllabic segmentation procedure was constructed based on the logic of abugida writing systems and implemented with a vocabulary of 2,843 tokens extracted from the Indonesian dictionary (KBBI). Evaluation was conducted on the NusaX dataset comprising 1,000 parallel translation samples across 10 regional languages, Indonesian, and English. Analysis employed Token per Character (TPC) ratio and sequence alignment using the Smith-Waterman algorithm. Results…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsNatural Language Processing Techniques · Topic Modeling · Computational and Text Analysis Methods
