# A Multiscale Visualization of Attention in the Transformer Model

**Authors:** Jesse Vig

arXiv: 1906.05714 · 2019-06-14

## TL;DR

This paper introduces an open-source visualization tool for multi-scale attention in Transformer models like BERT and GPT-2, aiding interpretability by revealing how models assign attention across layers and heads.

## Contribution

The paper presents a novel visualization tool that makes multi-layer, multi-head attention mechanisms in Transformers more interpretable and accessible.

## Key findings

- Effective in detecting model bias
- Identifies relevant attention heads
- Links neurons to model behavior

## Abstract

The Transformer is a sequence model that forgoes traditional recurrent architectures in favor of a fully attention-based approach. Besides improving performance, an advantage of using attention is that it can also help to interpret a model by showing how the model assigns weight to different input elements. However, the multi-layer, multi-head attention mechanism in the Transformer model can be difficult to decipher. To make the model more accessible, we introduce an open-source tool that visualizes attention at multiple scales, each of which provides a unique perspective on the attention mechanism. We demonstrate the tool on BERT and OpenAI GPT-2 and present three example use cases: detecting model bias, locating relevant attention heads, and linking neurons to model behavior.

## Full text

_Full body text omitted from this summary view._ Fetch the complete paper as Markdown: https://tomesphere.com/paper/1906.05714/full.md

## Figures

9 figures with captions in the complete paper: https://tomesphere.com/paper/1906.05714/full.md

## References

21 references — full list in the complete paper: https://tomesphere.com/paper/1906.05714/full.md

---
Source: https://tomesphere.com/paper/1906.05714