PanGu-$\pi$: Enhancing Language Model Architectures via Nonlinearity   Compensation

Yunhe Wang; Hanting Chen; Yehui Tang; Tianyu Guo; Kai Han; Ying Nie,; Xutao Wang; Hailin Hu; Zheyuan Bai; Yun Wang; Fangcheng Liu; Zhicheng Liu,; Jianyuan Guo; Sinan Zeng; Yinchen Zhang; Qinghua Xu; Qun Liu; Jun Yao; Chao; Xu; Dacheng Tao

arXiv:2312.17276·cs.CL·January 1, 2024·1 cites

PanGu-$\pi$: Enhancing Language Model Architectures via Nonlinearity Compensation

Yunhe Wang, Hanting Chen, Yehui Tang, Tianyu Guo, Kai Han, Ying Nie,, Xutao Wang, Hailin Hu, Zheyuan Bai, Yun Wang, Fangcheng Liu, Zhicheng Liu,, Jianyuan Guo, Sinan Zeng, Yinchen Zhang, Qinghua Xu, Qun Liu, Jun Yao, Chao, Xu, Dacheng Tao

PDF

Open Access

TL;DR

This paper introduces PanGu-$pi$, a new language model architecture that enhances nonlinearity to improve performance and efficiency, demonstrating competitive results and practical deployment in finance and law domains.

Contribution

The paper proposes a novel architecture for LLMs that emphasizes nonlinearity enhancement, addressing feature collapse and achieving improved accuracy and speed.

Findings

01

PanGu-$pi$-7B achieves comparable performance with 10% faster inference.

02

PanGu-$pi$-1B attains state-of-the-art accuracy and efficiency.

03

YunShan, a practical application of PanGu-$pi$-7B, surpasses similar models in benchmarks.

Abstract

The recent trend of large language models (LLMs) is to increase the scale of both model size (\aka the number of parameters) and dataset to achieve better generative ability, which is definitely proved by a lot of work such as the famous GPT and Llama. However, large models often involve massive computational costs, and practical applications cannot afford such high prices. However, the method of constructing a strong model architecture for LLMs is rarely discussed. We first analyze the state-of-the-art language model architectures and observe the feature collapse problem. Based on the theoretical analysis, we propose that the nonlinearity is also very important for language models, which is usually studied in convolutional neural networks for vision tasks. The series informed activation function is then introduced with tiny calculations that can be ignored, and an augmented shortcut is…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsTopic Modeling · Natural Language Processing Techniques · Speech Recognition and Synthesis

MethodsRefunds@Expedia|||How do I get a full refund from Expedia? · Multi-Head Attention · Attention Is All You Need · Linear Layer · Byte Pair Encoding · Dropout · Softmax · Cosine Annealing · Adam · Discriminative Fine-Tuning