Which transformer architecture fits my data? A vocabulary bottleneck in self-attention
Noam Wies, Yoav Levine, Daniel Jannai, Amnon Shashua

TL;DR
This paper investigates how the size of the vocabulary and embedding rank create a bottleneck in Transformer architectures, influencing their optimal depth-to-width ratio across different data modalities.
Contribution
It introduces a theoretical framework linking vocabulary size and embedding rank to Transformer architecture variability and demonstrates practical implications for model efficiency.
Findings
Existence of an embedding rank bottleneck limiting self-attention contribution
Link between vocabulary size, rank, and optimal depth-to-width ratio
Identification of 25-50% size redundancies in NLP models like ALBERT and T5
Abstract
After their successful debut in natural language processing, Transformer architectures are now becoming the de-facto standard in many domains. An obstacle for their deployment over new modalities is the architectural configuration: the optimal depth-to-width ratio has been shown to dramatically vary across data types (e.g., x larger over images than over language). We theoretically predict the existence of an embedding rank bottleneck that limits the contribution of self-attention width to the Transformer expressivity. We thus directly tie the input vocabulary size and rank to the optimal depth-to-width ratio, since a small vocabulary size or rank dictates an added advantage of depth over width. We empirically demonstrate the existence of this bottleneck and its implications on the depth-to-width interplay of Transformer architectures, linking the architecture variability across…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
Taxonomy
TopicsTopic Modeling · Natural Language Processing Techniques · Multimodal Machine Learning Applications
MethodsMulti-Head Attention · Attention Is All You Need · Linear Layer · Absolute Position Encodings · Position-Wise Feed-Forward Layer · Attention Dropout · Dropout · Inverse Square Root Schedule · Label Smoothing · Adafactor
