DNA 1.0 Technical Report

Jungyup Lee; Jemin Kim; Sang Park; SeungJae Lee

arXiv:2501.10648·cs.CL·January 22, 2025

DNA 1.0 Technical Report

Jungyup Lee, Jemin Kim, Sang Park, SeungJae Lee

PDF

Open Access 2 Models

TL;DR

DNA 1.0 8B Instruct is a bilingual language model optimized for Korean and English, achieving state-of-the-art results through advanced training techniques and merging strategies, and is openly available for use.

Contribution

The paper introduces DNA 1.0 8B Instruct, a novel bilingual model combining continual pre-training, supervised fine-tuning, and merging techniques to enhance Korean and English language capabilities.

Findings

01

State-of-the-art Korean task performance (e.g., KMMLU 53.26%)

02

Strong English task results (e.g., MMLU 66.64%)

03

Open availability of the model

Abstract

In this report, we present DNA 1.0 8B Instruct, a state-of-the-art bilingual language model optimized for Korean and English language tasks. By applying continual pre-training (CPT) with high-quality Korean datasets to Llama 3.1 8B and subsequent supervised fine-tuning (SFT), we create an instruction-following model with enhanced Korean language capabilities. This model is then merged with Llama 3.1 8B Instruct via spherical linear interpolation (SLERP) and undergoes further optimization through direct preference optimization (DPO) and knowledge distillation (KD). DNA 1.0 8B Instruct achieves state-of-the-art results on Korean-specific tasks, including KMMLU (53.26%), KoBEST (83.40%), and BELEBELE (57.99%), while maintaining strong English capabilities on MMLU (66.64%), MMLU-Pro (43.05%) and GSM8K (80.52%). As an open model, DNA 1.0 8B Instruct represents a significant advancement in…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Models

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsGene expression and cancer classification

MethodsLLaMA · Knowledge Distillation