RedWhale: An Adapted Korean LLM Through Efficient Continual Pretraining
Anh-Dung Vo, Minseong Jung, Wonbeen Lee, Daewoo Choi

TL;DR
RedWhale is a Korean-specific large language model developed through efficient continual pretraining, leveraging cross-lingual transfer to outperform existing models on Korean NLP benchmarks while reducing training costs.
Contribution
The paper introduces RedWhale, a novel Korean LLM built with an efficient pretraining strategy and cross-lingual transfer, addressing resource constraints and improving Korean NLP performance.
Findings
RedWhale outperforms existing models on Korean benchmarks.
Pretraining on 9.7 billion tokens shows no signs of convergence.
Efficient strategies reduce training time and computational costs.
Abstract
The field of Natural Language Processing (NLP) has seen significant advancements with the development of Large Language Models (LLMs). However, much of this research remains focused on English, often overlooking low-resource languages like Korean. This oversight presents challenges due to the unique non-alphabetic token structure of Korean and the substantial memory and computational demands required for LLM training, which frequently lead to memory constraints and out-of-memory errors. To address these issues, we present RedWhale, a model specifically tailored for Korean language processing. RedWhale is developed using an efficient continual pretraining approach that includes a comprehensive Korean corpus preprocessing pipeline, a specialized tokenizer, an optimized model initialization technique, and a multistage pretraining strategy. These innovations collectively reduce training…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsNatural Language Processing Techniques
