LLaMA Beyond English: An Empirical Study on Language Capability Transfer

Jun Zhao; Zhihao Zhang; Luhui Gao; Qi Zhang; Tao Gui; Xuanjing Huang

arXiv:2401.01055·cs.CL·January 15, 2024·6 cites

LLaMA Beyond English: An Empirical Study on Language Capability Transfer

Jun Zhao, Zhihao Zhang, Luhui Gao, Qi Zhang, Tao Gui, Xuanjing Huang

PDF

Open Access 1 Datasets

TL;DR

This paper empirically investigates how to transfer LLaMA's language capabilities to non-English languages, analyzing factors like vocabulary extension and instruction tuning, and demonstrating effective transfer with minimal pretraining data across multiple languages.

Contribution

It provides a comprehensive empirical study on transferring LLaMA's capabilities to non-English languages, highlighting effective methods and minimal data requirements.

Findings

01

Comparable performance achieved with less than 1% pretraining data

02

Effective transfer demonstrated across 13 low-resource languages

03

Evaluation shows maintained response quality and knowledge alignment

Abstract

In recent times, substantial advancements have been witnessed in large language models (LLMs), exemplified by ChatGPT, showcasing remarkable proficiency across a range of complex tasks. However, many mainstream LLMs (e.g. LLaMA) are pretrained on English-dominant corpus, which limits their performance in other non-English languages. In this paper, we focus on how to effectively transfer the capabilities of language generation and following instructions to a non-English language. To answer this question, we conduct an extensive empirical investigation based on LLaMA, accumulating over 1440 GPU hours. We analyze the impact of key factors such as vocabulary extension, further pretraining, and instruction tuning on transfer. To accurately assess the model's level of knowledge, we employ four widely used standardized testing benchmarks: C-Eval, MMLU, AGI-Eval, and GAOKAO-Bench. Furthermore,…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Datasets

desik98/UniversallyJailbreakingLLMInputOutputSafetyFilters
dataset· 168 dl
168 dl

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsTopic Modeling · Natural Language Processing Techniques · Text Readability and Simplification

MethodsFocus