CLUECorpus2020: A Large-scale Chinese Corpus for Pre-training Language   Model

Liang Xu; Xuanwei Zhang; Qianqian Dong

arXiv:2003.01355·cs.CL·March 6, 2020·34 cites

CLUECorpus2020: A Large-scale Chinese Corpus for Pre-training Language Model

Liang Xu, Xuanwei Zhang, Qianqian Dong

PDF

Open Access 2 Repos 1 Models

TL;DR

This paper introduces CLUECorpus2020, a large-scale Chinese corpus for pre-training language models, along with new vocabulary and models that improve efficiency and performance for Chinese NLP tasks.

Contribution

It provides a massive Chinese corpus, a compact vocabulary, and pre-trained models that enhance efficiency and achieve state-of-the-art results in Chinese language understanding.

Findings

01

Models trained on CLUECorpus2020 achieve excellent Chinese understanding performance.

02

The new 8K vocabulary reduces computational costs while maintaining accuracy.

03

Pre-trained models include a large version with state-of-the-art results and a tiny version for faster inference.

Abstract

In this paper, we introduce the Chinese corpus from CLUE organization, CLUECorpus2020, a large-scale corpus that can be used directly for self-supervised learning such as pre-training of a language model, or language generation. It has 100G raw corpus with 35 billion Chinese characters, which is retrieved from Common Crawl. To better understand this corpus, we conduct language understanding experiments on both small and large scale, and results show that the models trained on this corpus can achieve excellent performance on Chinese. We release a new Chinese vocabulary with a size of 8K, which is only one-third of the vocabulary size used in Chinese Bert released by Google. It saves computational cost and memory while works as good as original vocabulary. We also release both large and tiny versions of the pre-trained model on this corpus. The former achieves the state-of-the-art result,…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

Models

🤗
clue/roberta_chinese_3L312_clue_tiny
model· 3 dl· ♡ 2
3 dl♡ 2

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsTopic Modeling · Natural Language Processing Techniques · Multimodal Machine Learning Applications

MethodsLinear Layer · SPEED: Separable Pyramidal Pooling EncodEr-Decoder for Real-Time Monocular Depth Estimation on Low-Resource Settings · Residual Connection · Attention Dropout · Linear Warmup With Linear Decay · Weight Decay · Refunds@Expedia|||How do I get a full refund from Expedia? · Dense Connections · Adam · WordPiece