Optimizing Language Augmentation for Multilingual Large Language Models:   A Case Study on Korean

ChangSu Choi; Yongbin Jeong; Seoyoon Park; InHo Won; HyeonSeok Lim,; SangMin Kim; Yejee Kang; Chanhyuk Yoon; Jaewan Park; Yiseul Lee; HyeJin Lee,; Younggyun Hahm; Hansaem Kim; KyungTae Lim

arXiv:2403.10882·cs.CL·March 22, 2024·1 cites

Optimizing Language Augmentation for Multilingual Large Language Models: A Case Study on Korean

ChangSu Choi, Yongbin Jeong, Seoyoon Park, InHo Won, HyeonSeok Lim,, SangMin Kim, Yejee Kang, Chanhyuk Yoon, Jaewan Park, Yiseul Lee, HyeJin Lee,, Younggyun Hahm, Hansaem Kim, KyungTae Lim

PDF

Open Access 10 Models

TL;DR

This paper presents strategies to improve multilingual large language models for less-resourced languages, demonstrated through Korean, by expanding vocabularies, using bilingual pretraining, and instruction-tuning, resulting in superior performance.

Contribution

The study introduces a comprehensive approach combining vocabulary expansion, bilingual pretraining, and instruction-tuning to enhance LLM performance for less-resourced languages like Korean.

Findings

01

Bllossom outperformed previous Korean models in qualitative assessments.

02

Vocabulary expansion improved language expressiveness.

03

Instruction-tuning with high-quality data enhanced task performance.

Abstract

Large language models (LLMs) use pretraining to predict the subsequent word; however, their expansion requires significant computing resources. Numerous big tech companies and research institutes have developed multilingual LLMs (MLLMs) to meet current demands, overlooking less-resourced languages (LRLs). This study proposed three strategies to enhance the performance of LRLs based on the publicly available MLLMs. First, the MLLM vocabularies of LRLs were expanded to enhance expressiveness. Second, bilingual data were used for pretraining to align the high- and less-resourced languages. Third, a high-quality small-scale instruction dataset was constructed and instruction-tuning was performed to augment the LRL. The experiments employed the Llama2 model and Korean was used as the LRL, which was quantitatively evaluated against other developed LLMs across eight tasks. Furthermore, a…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Models

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsTopic Modeling · Natural Language Processing Techniques

MethodsALIGN