RedWhale: An Adapted Korean LLM Through Efficient Continual Pretraining

Anh-Dung Vo; Minseong Jung; Wonbeen Lee; Daewoo Choi

arXiv:2408.11294·cs.CL·August 22, 2024

RedWhale: An Adapted Korean LLM Through Efficient Continual Pretraining

Anh-Dung Vo, Minseong Jung, Wonbeen Lee, Daewoo Choi

PDF

Open Access 5 Models

TL;DR

RedWhale is a Korean-specific large language model developed through efficient continual pretraining, leveraging cross-lingual transfer to outperform existing models on Korean NLP benchmarks while reducing training costs.

Contribution

The paper introduces RedWhale, a novel Korean LLM built with an efficient pretraining strategy and cross-lingual transfer, addressing resource constraints and improving Korean NLP performance.

Findings

01

RedWhale outperforms existing models on Korean benchmarks.

02

Pretraining on 9.7 billion tokens shows no signs of convergence.

03

Efficient strategies reduce training time and computational costs.

Abstract

The field of Natural Language Processing (NLP) has seen significant advancements with the development of Large Language Models (LLMs). However, much of this research remains focused on English, often overlooking low-resource languages like Korean. This oversight presents challenges due to the unique non-alphabetic token structure of Korean and the substantial memory and computational demands required for LLM training, which frequently lead to memory constraints and out-of-memory errors. To address these issues, we present RedWhale, a model specifically tailored for Korean language processing. RedWhale is developed using an efficient continual pretraining approach that includes a comprehensive Korean corpus preprocessing pipeline, a specialized tokenizer, an optimized model initialization technique, and a multistage pretraining strategy. These innovations collectively reduce training…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Models

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsNatural Language Processing Techniques