Efficient and Effective Text Encoding for Chinese LLaMA and Alpaca

Yiming Cui; Ziqing Yang; Xin Yao

arXiv:2304.08177·cs.CL·February 26, 2024·72 cites

Efficient and Effective Text Encoding for Chinese LLaMA and Alpaca

Yiming Cui, Ziqing Yang, Xin Yao

PDF

Open Access 5 Repos 8 Models 1 Datasets

TL;DR

This paper presents a method to adapt LLaMA for Chinese language understanding and generation by expanding its vocabulary, additional pre-training, and instruction fine-tuning, resulting in improved performance on Chinese NLP tasks.

Contribution

The authors extend LLaMA with Chinese tokens, perform secondary pre-training, and fine-tune with Chinese instructions, enabling effective Chinese language capabilities.

Findings

01

Enhanced Chinese understanding and generation in LLaMA.

02

Competitive performance on Chinese NLP benchmarks.

03

Open-sourced models and resources for community use.

Abstract

Large Language Models (LLMs), such as ChatGPT and GPT-4, have dramatically transformed natural language processing research and shown promising strides towards Artificial General Intelligence (AGI). Nonetheless, the high costs associated with training and deploying LLMs present substantial obstacles to transparent, accessible academic research. While several large language models, such as LLaMA, have been open-sourced by the community, these predominantly focus on English corpora, limiting their usefulness for other languages. In this paper, we propose a method to augment LLaMA with capabilities for understanding and generating Chinese text and its ability to follow instructions. We achieve this by extending LLaMA's existing vocabulary with an additional 20,000 Chinese tokens, thereby improving its encoding efficiency and semantic understanding of Chinese. We further incorporate…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

Models

Datasets

INX-TEXT/Bailong-bench
dataset· 16 dl
16 dl

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsNatural Language Processing Techniques · Handwritten Text Recognition Techniques · Speech Recognition and Synthesis

MethodsMulti-Head Attention · Linear Layer · Absolute Position Encodings · Position-Wise Feed-Forward Layer · Attention Is All You Need · Label Smoothing · Dropout · Residual Connection · Softmax · Byte Pair Encoding