GECKO: Generative Language Model for English, Code and Korean

Sungwoo Oh; Donggyu Kim

arXiv:2405.15640·cs.CL·May 27, 2024

GECKO: Generative Language Model for English, Code and Korean

Sungwoo Oh, Donggyu Kim

PDF

Open Access 1 Models

TL;DR

GECKO is a bilingual large language model optimized for Korean, English, and programming languages, demonstrating efficient token generation and competitive performance on benchmarks despite its smaller size.

Contribution

This work introduces GECKO, a bilingual LLM trained on a balanced Korean-English corpus using LLaMA architecture, with insights into data pipeline improvements and open-source availability.

Findings

01

High efficiency in token generation for Korean and English

02

Strong performance on Korean MMLU benchmark

03

Modest performance in English and Code benchmarks

Abstract

We introduce GECKO, a bilingual large language model (LLM) optimized for Korean and English, along with programming languages. GECKO is pretrained on the balanced, high-quality corpus of Korean and English employing LLaMA architecture. In this report, we share the experiences of several efforts to build a better data pipeline for the corpus and to train our model. GECKO shows great efficiency in token generations for both Korean and English, despite its small size of vocabulary. We measure the performance on the representative benchmarks in terms of Korean, English and Code, and it exhibits great performance on KMMLU (Korean MMLU) and modest performance in English and Code, even with its smaller number of trained tokens compared to English-focused LLMs. GECKO is available to the open-source community under a permissive license. We hope our work offers a research baseline and practical…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Models

🤗
kifai/GECKO-7B
model· 78 dl· ♡ 14
78 dl♡ 14

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsNatural Language Processing Techniques

MethodsLLaMA