Optimizing Language Augmentation for Multilingual Large Language Models: A Case Study on Korean
ChangSu Choi, Yongbin Jeong, Seoyoon Park, InHo Won, HyeonSeok Lim,, SangMin Kim, Yejee Kang, Chanhyuk Yoon, Jaewan Park, Yiseul Lee, HyeJin Lee,, Younggyun Hahm, Hansaem Kim, KyungTae Lim

TL;DR
This paper presents strategies to improve multilingual large language models for less-resourced languages, demonstrated through Korean, by expanding vocabularies, using bilingual pretraining, and instruction-tuning, resulting in superior performance.
Contribution
The study introduces a comprehensive approach combining vocabulary expansion, bilingual pretraining, and instruction-tuning to enhance LLM performance for less-resourced languages like Korean.
Findings
Bllossom outperformed previous Korean models in qualitative assessments.
Vocabulary expansion improved language expressiveness.
Instruction-tuning with high-quality data enhanced task performance.
Abstract
Large language models (LLMs) use pretraining to predict the subsequent word; however, their expansion requires significant computing resources. Numerous big tech companies and research institutes have developed multilingual LLMs (MLLMs) to meet current demands, overlooking less-resourced languages (LRLs). This study proposed three strategies to enhance the performance of LRLs based on the publicly available MLLMs. First, the MLLM vocabularies of LRLs were expanded to enhance expressiveness. Second, bilingual data were used for pretraining to align the high- and less-resourced languages. Third, a high-quality small-scale instruction dataset was constructed and instruction-tuning was performed to augment the LRL. The experiments employed the Llama2 model and Korean was used as the LRL, which was quantitatively evaluated against other developed LLMs across eight tasks. Furthermore, a…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
- 🤗MLP-KTLim/llama-3-Korean-Bllossom-8Bmodel· 4.0k dl· ♡ 3884.0k dl♡ 388
- 🤗Bllossom/llama-3-Korean-Bllossom-70Bmodel· 36 dl· ♡ 9236 dl♡ 92
- 🤗Bllossom/llama-3-Korean-Bllossom-70B-gguf-Q4_K_Mmodel· 103 dl· ♡ 54103 dl♡ 54
- 🤗MLP-KTLim/llama-3-Korean-Bllossom-8B-gguf-Q4_K_Mmodel· 925 dl· ♡ 84925 dl♡ 84
- 🤗kfkas/Hansung-Bllossom-8Bmodel· 1 dl1 dl
- 🤗QuantFactory/llama-3-Korean-Bllossom-8B-GGUFmodel· 159 dl· ♡ 3159 dl♡ 3
- 🤗kfkas/Hansung-Llama-3-8Bmodel· 1 dl1 dl
- 🤗iknow-lab/ko-genstruct-v0.1model· 4 dl· ♡ 24 dl♡ 2
- 🤗leesm/llama-3-Korean-Bllossom-8B-trexlab-oki10pmodel· 4 dl4 dl
- 🤗Bllossom/llama-3.1-Korean-Bllossom-405Bmodel· 9 dl· ♡ 559 dl♡ 55
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsTopic Modeling · Natural Language Processing Techniques
MethodsALIGN
