Transformer Dissection: A Unified Understanding of Transformer's   Attention via the Lens of Kernel

Yao-Hung Hubert Tsai; Shaojie Bai; Makoto Yamada and; Louis-Philippe Morency; Ruslan Salakhutdinov

arXiv:1908.11775·cs.LG·November 13, 2019

Transformer Dissection: A Unified Understanding of Transformer's Attention via the Lens of Kernel

Yao-Hung Hubert Tsai, Shaojie Bai, Makoto Yamada and, Louis-Philippe Morency, Ruslan Salakhutdinov

PDF

1 Repo

TL;DR

This paper introduces a kernel-based formulation of Transformer's attention mechanism, providing new insights into its components, and proposes a novel attention variant that achieves competitive results with reduced computation.

Contribution

It offers a unified kernel perspective on attention, enabling new variants and better understanding of positional embedding integration.

Findings

01

Kernel formulation clarifies attention components

02

Proposed kernel-based attention variant reduces computation

03

Achieves competitive performance on translation and sequence prediction tasks

Abstract

Transformer is a powerful architecture that achieves superior performance on various sequence learning tasks, including neural machine translation, language understanding, and sequence prediction. At the core of the Transformer is the attention mechanism, which concurrently processes all inputs in the streams. In this paper, we present a new formulation of attention via the lens of the kernel. To be more precise, we realize that the attention can be seen as applying kernel smoother over the inputs with the kernel scores being the similarities between inputs. This new formulation gives us a better way to understand individual components of the Transformer's attention, such as the better way to integrate the positional embedding. Another important advantage of our kernel-based formulation is that it paves the way to a larger space of composing Transformer's attention. As an example, we…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

yaohungt/TransformerDissection
noneOfficial

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

MethodsLinear Layer · Absolute Position Encodings · Position-Wise Feed-Forward Layer · Residual Connection · Byte Pair Encoding · Dense Connections · Label Smoothing · *Communicated@Fast*How Do I Communicate to Expedia? · Adam · Softmax