N-Grammer: Augmenting Transformers with latent n-grams
Aurko Roy, Rohan Anil, Guangda Lai, Benjamin Lee, Jeffrey Zhao,, Shuyuan Zhang, Shibo Wang, Ye Zhang, Shen Wu, Rigel Swavely, Tao (Alex) Yu,, Phuong Dao, Christopher Fifty, Zhifeng Chen, Yonghui Wu

TL;DR
N-Grammer enhances Transformer models by integrating latent n-grams, leading to improved performance in language modeling and text classification while aiming to reduce computational costs.
Contribution
The paper introduces a novel augmentation of Transformer architecture with latent n-grams, inspired by statistical language modeling, to improve efficiency and effectiveness.
Findings
Outperforms Transformer and Primer baselines on language modeling and text classification.
Achieves better results on C4 and SuperGLUE datasets.
Open-sourced implementation in Jax for reproducibility.
Abstract
Transformer models have recently emerged as one of the foundational models in natural language processing, and as a byproduct, there is significant recent interest and investment in scaling these models. However, the training and inference costs of these large Transformer language models are prohibitive, thus necessitating more research in identifying more efficient variants. In this work, we propose a simple yet effective modification to the Transformer architecture inspired by the literature in statistical language modeling, by augmenting the model with n-grams that are constructed from a discrete latent representation of the text sequence. We evaluate our model, the N-Grammer on language modeling on the C4 data-set as well as text classification on the SuperGLUE data-set, and find that it outperforms several strong baselines such as the Transformer and the Primer. We open-source our…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsNatural Language Processing Techniques · Topic Modeling · Speech Recognition and Synthesis
MethodsMulti-Head Attention · Attention Is All You Need · Linear Layer · Depthwise Convolution · Softmax · Squared ReLU · Dense Connections · Absolute Position Encodings · Dropout · Byte Pair Encoding
