PrahokBART: A Pre-trained Sequence-to-Sequence Model for Khmer Natural Language Generation
Hour Kaing, Raj Dabre, Haiyue Song, Van-Hien Tran, Hideki Tanaka, Masao Utiyama

TL;DR
PrahokBART is a specialized pre-trained sequence-to-sequence model for Khmer that improves natural language generation tasks by incorporating linguistic features, outperforming existing multilingual models like mBART50.
Contribution
It introduces PrahokBART, a Khmer-specific pre-trained model with linguistic modules, addressing unique language challenges and enhancing performance on multiple generative tasks.
Findings
PrahokBART outperforms mBART50 on translation, summarization, and headline generation.
Incorporating linguistic components improves Khmer text generation quality.
Analysis shows effective handling of space and linguistic features enhances naturalness.
Abstract
This work introduces {\it PrahokBART}, a compact pre-trained sequence-to-sequence model trained from scratch for Khmer using carefully curated Khmer and English corpora. We focus on improving the pre-training corpus quality and addressing the linguistic issues of Khmer, which are ignored in existing multilingual models, by incorporating linguistic components such as word segmentation and normalization. We evaluate PrahokBART on three generative tasks: machine translation, text summarization, and headline generation, where our results demonstrate that it outperforms mBART50, a strong multilingual pre-trained model. Additionally, our analysis provides insights into the impact of each linguistic module and evaluates how effectively our model handles space during text generation, which is crucial for the naturalness of texts in Khmer.
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsNatural Language Processing Techniques · Topic Modeling · Multimodal Machine Learning Applications
