FPM: A Collection of Large-scale Foundation Pre-trained Language Models
Dezhou Shen

TL;DR
This paper introduces a comprehensive collection of large-scale transformer-based language models, exploring architecture optimization, scaling strategies, and providing the largest Chinese models, to establish strong baselines for future NLP research.
Contribution
It presents a unified set of large-scale transformer models with optimized depth and training techniques, including the largest Chinese language models, to serve as new benchmarks.
Findings
Scaling transformer architectures improves performance when avoiding training defects.
Optimal number of layers depends on specific tasks and models.
Largest Chinese generative and encoding models are introduced.
Abstract
Large-scale Transformer models have significantly promoted the recent development of natural language processing applications. However, little effort has been made to unify the effective models. In this paper, driven by providing a new set of baseline models in the future, we adopt various novel transformer architectures and launch a model set with the help of recent mainstream technologies. We focus the discussions on optimizing the depth of the networks based on the existing powerful encode-decoder structures. We show that by properly avoiding training defects such as non-convergence and degradation, scaling up off-the-shelf transformer architectures consistently delivers better performance. To stimulate future research on large-scale language model pretraining, we present extensive results and detailed discussions on network performance improvements with respect to the network depth…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsTopic Modeling · Natural Language Processing Techniques · Speech Recognition and Synthesis
MethodsRefunds@Expedia|||How do I get a full refund from Expedia? · Multi-Head Attention · Attention Is All You Need · Linear Layer · Cosine Annealing · Linear Warmup With Cosine Annealing · Discriminative Fine-Tuning · Residual Connection · WordPiece · Dense Connections
