PanGu-$\alpha$: Large-scale Autoregressive Pretrained Chinese Language   Models with Auto-parallel Computation

Wei Zeng; Xiaozhe Ren; Teng Su; Hui Wang; Yi Liao; Zhiwei Wang; Xin; Jiang; ZhenZhang Yang; Kaisheng Wang; Xiaoda Zhang; Chen Li; Ziyan Gong,; Yifan Yao; Xinjing Huang; Jun Wang; Jianfeng Yu; Qi Guo; Yue Yu; Yan Zhang,; Jin Wang; Hengtao Tao; Dasen Yan; Zexuan Yi; Fang Peng; Fangqing Jiang; Han; Zhang; Lingfeng Deng; Yehong Zhang; Zhe Lin; Chao Zhang; Shaojie Zhang,; Mingyue Guo; Shanzhi Gu; Gaojun Fan; Yaowei Wang; Xuefeng Jin; Qun Liu,; Yonghong Tian

arXiv:2104.12369·cs.CL·April 27, 2021·94 cites

PanGu-$\alpha$: Large-scale Autoregressive Pretrained Chinese Language Models with Auto-parallel Computation

Wei Zeng, Xiaozhe Ren, Teng Su, Hui Wang, Yi Liao, Zhiwei Wang, Xin, Jiang, ZhenZhang Yang, Kaisheng Wang, Xiaoda Zhang, Chen Li, Ziyan Gong,, Yifan Yao, Xinjing Huang, Jun Wang, Jianfeng Yu, Qi Guo, Yue Yu, Yan Zhang,, Jin Wang, Hengtao Tao, Dasen Yan, Zexuan Yi, Fang Peng

PDF

Open Access 5 Repos 1 Video

TL;DR

This paper introduces PanGu-$\alpha$, a large-scale autoregressive Chinese language model with 200 billion parameters, trained on extensive data and optimized with advanced parallelism techniques, demonstrating strong few-shot and zero-shot NLP performance.

Contribution

The paper presents the development and training of PanGu-$\alpha$, a 200-billion-parameter Chinese language model utilizing a novel auto-parallel training strategy on a large AI cluster.

Findings

01

PanGu-$\alpha$ achieves superior few-shot and zero-shot NLP task performance.

02

Efficient training of 200-billion-parameter model using MindSpore auto-parallelism.

03

Model generalizes well across diverse Chinese NLP tasks.

Abstract

Large-scale Pretrained Language Models (PLMs) have become the new paradigm for Natural Language Processing (NLP). PLMs with hundreds of billions parameters such as GPT-3 have demonstrated strong performances on natural language understanding and generation with \textit{few-shot in-context} learning. In this work, we present our practice on training large-scale autoregressive language models named PanGu- $α$ , with up to 200 billion parameters. PanGu- $α$ is developed under the MindSpore and trained on a cluster of 2048 Ascend 910 AI processors. The training parallelism strategy is implemented based on MindSpore Auto-parallel, which composes five parallelism dimensions to scale the training task to 2048 processors efficiently, including data parallelism, op-level model parallelism, pipeline model parallelism, optimizer model parallelism and rematerialization. To enhance the…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

Videos

[ML News] EU regulates AI, China trains 1.75T model, Google's oopsie, Everybody cheers for fraud.· youtube

Taxonomy

TopicsTopic Modeling · Natural Language Processing Techniques · Multimodal Machine Learning Applications

MethodsAttention Is All You Need · Absolute Position Encodings · Position-Wise Feed-Forward Layer · Label Smoothing · Transformer · PanGu-$α$ · Linear Layer · Refunds@Expedia|||How do I get a full refund from Expedia? · Multi-Head Attention · Byte Pair Encoding