Contextual Counting: A Mechanistic Study of Transformers on a   Quantitative Task

Siavash Golkar; Alberto Bietti; Mariel Pettee; Michael Eickenberg,; Miles Cranmer; Keiya Hirashima; Geraud Krawezik; Nicholas Lourie; Michael; McCabe; Rudy Morel; Ruben Ohana; Liam Holden Parker; Bruno R\'egaldo-Saint; Blancard; Kyunghyun Cho; Shirley Ho

arXiv:2406.02585·cs.LG·June 6, 2024

Contextual Counting: A Mechanistic Study of Transformers on a Quantitative Task

Siavash Golkar, Alberto Bietti, Mariel Pettee, Michael Eickenberg,, Miles Cranmer, Keiya Hirashima, Geraud Krawezik, Nicholas Lourie, Michael, McCabe, Rudy Morel, Ruben Ohana, Liam Holden Parker, Bruno R\'egaldo-Saint, Blancard, Kyunghyun Cho, Shirley Ho

PDF

Open Access

TL;DR

This paper introduces the contextual counting task to analyze Transformer behavior in quantitative tasks, revealing that causal attention and absence of positional embeddings improve accuracy and interpretability.

Contribution

It provides a theoretical and empirical study of Transformers on a new quantitative task, highlighting the impact of attention types and positional encodings.

Findings

01

Causal attention outperforms non-causal in the task.

02

No positional embeddings yield the best accuracy.

03

Rotary embeddings are competitive and easier to train.

Abstract

Transformers have revolutionized machine learning across diverse domains, yet understanding their behavior remains crucial, particularly in high-stakes applications. This paper introduces the contextual counting task, a novel toy problem aimed at enhancing our understanding of Transformers in quantitative and scientific contexts. This task requires precise localization and computation within datasets, akin to object detection or region-based scientific analysis. We present theoretical and empirical analysis using both causal and non-causal Transformer architectures, investigating the influence of various positional encodings on performance and interpretability. In particular, we find that causal attention is much better suited for the task, and that no positional embeddings lead to the best accuracy, though rotary embeddings are competitive and easier to train. We also show that out of…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsNeural Networks and Applications

MethodsAttention Is All You Need · Softmax · Layer Normalization · Linear Layer · Byte Pair Encoding · Label Smoothing · Adam · Residual Connection · Position-Wise Feed-Forward Layer · Multi-Head Attention