Tokenizations for Austronesian Language Models: study on languages in Indonesia Archipelago

Andhika Bernard Lumbantobing; Hokky Situngkir

arXiv:2602.06998·cs.CY·February 10, 2026

Tokenizations for Austronesian Language Models: study on languages in Indonesia Archipelago

Andhika Bernard Lumbantobing, Hokky Situngkir

PDF

Open Access 1 Models

TL;DR

This paper introduces a syllable-based tokenization method inspired by traditional Indonesian scripts, improving linguistic alignment and consistency across Austronesian languages for large language models.

Contribution

It develops a novel syllable-based tokenization framework tailored for Austronesian languages, addressing limitations of subword methods optimized on English corpora.

Findings

01

Syllable-based tokenization yields consistent TPC across languages.

02

It increases token sequence similarity by approximately 21% over GPT-2.

03

The approach better preserves phonological and morphological patterns.

Abstract

Tokenization constitutes a fundamental stage in Large Language Model (LLM) processing; however, subword-based tokenization methods optimized on English-dominant corpora may produce token fragmentation misaligned with the linguistic structures of Austronesian languages. This study aimed to develop a syllable-based tokenization framework adopting principles from traditional Indonesian scripts (aksara) for regional languages of Indonesia. A syllabic segmentation procedure was constructed based on the logic of abugida writing systems and implemented with a vocabulary of 2,843 tokens extracted from the Indonesian dictionary (KBBI). Evaluation was conducted on the NusaX dataset comprising 1,000 parallel translation samples across 10 regional languages, Indonesian, and English. Analysis employed Token per Character (TPC) ratio and sequence alignment using the Smith-Waterman algorithm. Results…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Models

🤗
ai-toba/toba-trilingual-1.2B
model· 4 dl· ♡ 1
4 dl♡ 1

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsNatural Language Processing Techniques · Topic Modeling · Computational and Text Analysis Methods