The NLP Cookbook: Modern Recipes for Transformer based Deep Learning Architectures
Sushant Singh, Ausif Mahmood

TL;DR
This paper reviews recent advancements in Transformer-based NLP models, focusing on their architectures, efficiency improvements, and future research directions, highlighting the balance between performance and computational costs.
Contribution
It provides a comprehensive summary and taxonomy of state-of-the-art NLP models, analyzing their architectures, efficiencies, and potential future developments.
Findings
Transformers have revolutionized NLP with high performance.
Recent models incorporate transfer learning, pruning, and distillation.
Efforts are ongoing to improve inference efficiency and handle longer sequences.
Abstract
In recent years, Natural Language Processing (NLP) models have achieved phenomenal success in linguistic and semantic tasks like text classification, machine translation, cognitive dialogue systems, information retrieval via Natural Language Understanding (NLU), and Natural Language Generation (NLG). This feat is primarily attributed due to the seminal Transformer architecture, leading to designs such as BERT, GPT (I, II, III), etc. Although these large-size models have achieved unprecedented performances, they come at high computational costs. Consequently, some of the recent NLP architectures have utilized concepts of transfer learning, pruning, quantization, and knowledge distillation to achieve moderate model sizes while keeping nearly similar performances as achieved by their predecessors. Additionally, to mitigate the data size challenge raised by language models from a knowledge…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
MethodsMulti-Head Attention · Linear Layer · Knowledge Distillation · Absolute Position Encodings · Position-Wise Feed-Forward Layer · Cosine Annealing · Linear Warmup With Cosine Annealing · Label Smoothing · Byte Pair Encoding · Dropout
