OpenSeal: Good, Fast, and Cheap Construction of an Open-Source Southeast Asian LLM via Parallel Data
Tan Sang Nguyen, Muhammad Reza Qorib, Hwee Tou Ng

TL;DR
OpenSeal is a cost-effective, open-source Southeast Asian language model built using parallel data, demonstrating that parallel data alone can effectively extend LLMs to new languages, with performance comparable to existing models.
Contribution
This work introduces OpenSeal, the first truly open-source Southeast Asian LLM, and demonstrates the effectiveness of using only parallel data for multilingual extension.
Findings
Parallel data is highly effective for multilingual LLM extension.
OpenSeal achieves competitive performance with minimal data and compute.
Using 34.7B tokens of parallel data suffices for high-quality multilingual models.
Abstract
Large language models (LLMs) have proven to be effective tools for a wide range of natural language processing (NLP) applications. Although many LLMs are multilingual, most remain English-centric and perform poorly on low-resource languages. Recently, several Southeast Asia-focused LLMs have been developed, but none are truly open source, as they do not publicly disclose their training data. Truly open-source models are important for transparency and for enabling a deeper and more precise understanding of LLM internals and development, including biases, generalization, and multilinguality. Motivated by recent advances demonstrating the effectiveness of parallel data in improving multilingual performance, we conduct controlled and comprehensive experiments to study the effectiveness of parallel data in continual pretraining of LLMs. Our findings show that using only parallel data is the…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsTopic Modeling · Natural Language Processing Techniques · Computational and Text Analysis Methods
