Efficient and Effective Text Encoding for Chinese LLaMA and Alpaca
Yiming Cui, Ziqing Yang, Xin Yao

TL;DR
This paper presents a method to adapt LLaMA for Chinese language understanding and generation by expanding its vocabulary, additional pre-training, and instruction fine-tuning, resulting in improved performance on Chinese NLP tasks.
Contribution
The authors extend LLaMA with Chinese tokens, perform secondary pre-training, and fine-tune with Chinese instructions, enabling effective Chinese language capabilities.
Findings
Enhanced Chinese understanding and generation in LLaMA.
Competitive performance on Chinese NLP benchmarks.
Open-sourced models and resources for community use.
Abstract
Large Language Models (LLMs), such as ChatGPT and GPT-4, have dramatically transformed natural language processing research and shown promising strides towards Artificial General Intelligence (AGI). Nonetheless, the high costs associated with training and deploying LLMs present substantial obstacles to transparent, accessible academic research. While several large language models, such as LLaMA, have been open-sourced by the community, these predominantly focus on English corpora, limiting their usefulness for other languages. In this paper, we propose a method to augment LLaMA with capabilities for understanding and generating Chinese text and its ability to follow instructions. We achieve this by extending LLaMA's existing vocabulary with an additional 20,000 Chinese tokens, thereby improving its encoding efficiency and semantic understanding of Chinese. We further incorporate…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
- 🤗rabitt/Chinese-Alpaca-Plus-13B-GPTQmodel· 2 dl· ♡ 32 dl♡ 3
- 🤗neukg/TechGPT-2.0-alpaca-hfmodel· 8 dl· ♡ 18 dl♡ 1
- 🤗neukg/TechGPT-2.0-atom-hfmodel· 7 dl· ♡ 37 dl♡ 3
- 🤗neukg/TechGPT-2.0-QLora-hfmodel· ♡ 2♡ 2
- 🤗INX-TEXT/Bailong-instruct-7Bmodel· ♡ 46♡ 46
- 🤗NeroUCH/Bailong-instruct-7B-GGUFmodel· 68 dl· ♡ 968 dl♡ 9
- 🤗INX-TEXT/Bailong-orpo-7Bmodel· 19 dl· ♡ 519 dl♡ 5
- 🤗neukg/TechGPT-2.0-Qwen1.5-7bmodel· 1 dl· ♡ 11 dl♡ 1
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsNatural Language Processing Techniques · Handwritten Text Recognition Techniques · Speech Recognition and Synthesis
MethodsMulti-Head Attention · Linear Layer · Absolute Position Encodings · Position-Wise Feed-Forward Layer · Attention Is All You Need · Label Smoothing · Dropout · Residual Connection · Softmax · Byte Pair Encoding
