Counting in Small Transformers: The Delicate Interplay between Attention and Feed-Forward Layers

Freya Behrens; Luca Biggio; Lenka Zdeborov\'a

arXiv:2407.11542·cs.LG·November 13, 2025

Counting in Small Transformers: The Delicate Interplay between Attention and Feed-Forward Layers

Freya Behrens, Luca Biggio, Lenka Zdeborov\'a

PDF

Open Access 1 Repo

TL;DR

This paper investigates how small transformer architectures implement counting tasks, revealing two distinct strategies and how design choices influence their performance and robustness.

Contribution

It identifies two theoretical counting strategies in transformers and analyzes how architecture and design choices affect their solution mechanisms.

Findings

01

Two counting strategies: relation-based and inventory-based.

02

Design choices like softmax and special tokens improve robustness.

03

Empirical evidence confirms the theoretical learning regimes.

Abstract

Next to scaling considerations, architectural design choices profoundly shape the solution space of transformers. In this work, we analyze the solutions simple transformer blocks implement when tackling the histogram task: counting items in sequences. Despite its simplicity, this task reveals a complex interplay between predictive performance, vocabulary and embedding sizes, token-mixing mechanisms, and feed-forward layer capacity. We identify two theoretical counting strategies transformers adopt, relation-based and inventory-based counting, each defining distinct learning regimes for the task. These strategies dictate how functionality is distributed between attention and feed-forward layers. We further show that adding softmax and beginning-of-sequence tokens allow for more robustness when embedding dimensions are comparatively small. Empirical introspection of trained models closely…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

SPOC-group/counting-attention
pytorchOfficial

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsNeural Networks and Applications

MethodsAttention Is All You Need · Softmax