Bidirectional Attention as a Mixture of Continuous Word Experts
Kevin Christian Wibisono, Yixin Wang

TL;DR
This paper provides a statistical interpretation of bidirectional attention as a mixture of experts model, revealing its connection to CBOW and MoE, and demonstrates its advantages in handling heterogeneous data and in out-of-distribution generalization.
Contribution
It introduces a novel statistical perspective of bidirectional attention as a mixture of experts, explaining its effectiveness and extending its application to categorical tabular data.
Findings
Bidirectional attention is equivalent to a mixture of experts model.
Extending this model to tabular data improves out-of-distribution generalization.
Stronger assumptions are needed for linear word analogies in attention-based models.
Abstract
Bidirectional attention composed of self-attention with positional encodings and the masked language model (MLM) objective has emerged as a key component of modern large language models (LLMs). Despite its empirical success, few studies have examined its statistical underpinnings: What statistical model is bidirectional attention implicitly fitting? What sets it apart from its non-attention predecessors? We explore these questions in this paper. The key observation is that fitting a single-layer single-head bidirectional attention, upon reparameterization, is equivalent to fitting a continuous bag of words (CBOW) model with mixture-of-experts (MoE) weights. Further, bidirectional attention with multiple heads and multiple layers is equivalent to stacked MoEs and a mixture of MoEs, respectively. This statistical viewpoint reveals the distinct use of…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsTopic Modeling · Natural Language Processing Techniques · Computational and Text Analysis Methods
