Which transformer architecture fits my data? A vocabulary bottleneck in   self-attention

Noam Wies; Yoav Levine; Daniel Jannai; Amnon Shashua

arXiv:2105.03928·cs.LG·June 10, 2021·6 cites

Which transformer architecture fits my data? A vocabulary bottleneck in self-attention

Noam Wies, Yoav Levine, Daniel Jannai, Amnon Shashua

PDF

Open Access 1 Video

TL;DR

This paper investigates how the size of the vocabulary and embedding rank create a bottleneck in Transformer architectures, influencing their optimal depth-to-width ratio across different data modalities.

Contribution

It introduces a theoretical framework linking vocabulary size and embedding rank to Transformer architecture variability and demonstrates practical implications for model efficiency.

Findings

01

Existence of an embedding rank bottleneck limiting self-attention contribution

02

Link between vocabulary size, rank, and optimal depth-to-width ratio

03

Identification of 25-50% size redundancies in NLP models like ALBERT and T5

Abstract

After their successful debut in natural language processing, Transformer architectures are now becoming the de-facto standard in many domains. An obstacle for their deployment over new modalities is the architectural configuration: the optimal depth-to-width ratio has been shown to dramatically vary across data types (e.g., $10$ x larger over images than over language). We theoretically predict the existence of an embedding rank bottleneck that limits the contribution of self-attention width to the Transformer expressivity. We thus directly tie the input vocabulary size and rank to the optimal depth-to-width ratio, since a small vocabulary size or rank dictates an added advantage of depth over width. We empirically demonstrate the existence of this bottleneck and its implications on the depth-to-width interplay of Transformer architectures, linking the architecture variability across…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

Which transformer architecture fits my data? A vocabulary bottleneck in self-attention· slideslive

Taxonomy

TopicsTopic Modeling · Natural Language Processing Techniques · Multimodal Machine Learning Applications

MethodsMulti-Head Attention · Attention Is All You Need · Linear Layer · Absolute Position Encodings · Position-Wise Feed-Forward Layer · Attention Dropout · Dropout · Inverse Square Root Schedule · Label Smoothing · Adafactor