PhayaThaiBERT: Enhancing a Pretrained Thai Language Model with Unassimilated Loanwords

Panyut Sriwirote; Jalinee Thapiang; Vasan Timtong; Attapol T. Rutherford

arXiv:2311.12475·cs.CL·November 18, 2025·5 cites

PhayaThaiBERT: Enhancing a Pretrained Thai Language Model with Unassimilated Loanwords

Panyut Sriwirote, Jalinee Thapiang, Vasan Timtong, Attapol T. Rutherford

PDF

Open Access 1 Repo

TL;DR

PhayaThaiBERT is a Thai language model that improves understanding of foreign words, especially English loanwords, by expanding its vocabulary and pretraining on a larger dataset, leading to better downstream task performance.

Contribution

The paper introduces PhayaThaiBERT, a Thai language model with an expanded vocabulary for foreign words, achieved through vocabulary transfer and additional pretraining, enhancing language understanding.

Findings

01

PhayaThaiBERT outperforms WangchanBERTa in multiple downstream tasks.

02

Vocabulary expansion improves foreign word comprehension.

03

Pretraining on a larger dataset boosts model performance.

Abstract

While WangchanBERTa has become the de facto standard in transformer-based Thai language modeling, it still has shortcomings in regard to the understanding of foreign words, most notably English words, which are often borrowed without orthographic assimilation into Thai in many contexts. We identify the lack of foreign vocabulary in WangchanBERTa's tokenizer as the main source of these shortcomings. We then expand WangchanBERTa's vocabulary via vocabulary transfer from XLM-R's pretrained tokenizer and pretrain a new model using the expanded tokenizer, starting from WangchanBERTa's checkpoint, on a new dataset that is larger than the one used to train WangchanBERTa. Our results show that our new pretrained model, PhayaThaiBERT, outperforms WangchanBERTa in many downstream tasks and datasets.

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

clicknext-ai/phayathaibert
pytorchOfficial

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsNatural Language Processing Techniques · Topic Modeling · Speech Recognition and Synthesis