When Can Transformers Count to n?

Gilad Yehudai; Haim Kaplan; Guy Dar; Royi Rassin; Asma Ghandeharioun; Mor Geva; Amir Globerson

arXiv:2407.15160·cs.CL·February 26, 2026·1 cites

When Can Transformers Count to n?

Gilad Yehudai, Haim Kaplan, Guy Dar, Royi Rassin, Asma Ghandeharioun, Mor Geva, Amir Globerson

PDF

Open Access 3 Reviews

TL;DR

This paper investigates the fundamental limitations of transformer models in counting tasks, revealing a phase transition governed by the relationship between embedding dimension and vocabulary size, which impacts their ability to accurately count tokens.

Contribution

It provides a theoretical analysis of counting limitations in transformers, identifying a critical threshold where counting becomes unstable and unlearnable, supported by empirical validation.

Findings

01

Transformers can count accurately when embedding dimension ≥ vocabulary size.

02

Counting becomes unstable and unlearnable when vocabulary exceeds embedding dimension.

03

Pretrained models also exhibit counting failures consistent with the theoretical threshold.

Abstract

Large language models based on the transformer architecture can solve highly complex tasks, yet their fundamental limitations on simple algorithmic problems remain poorly understood. In this work, we focus on basic counting tasks and investigate how the difficulty of these tasks scales with the transformer embedding dimension, the context length, and the vocabulary size. We reveal a sharp theoretical phase transition governed by the relationship between the embedding dimension and the vocabulary size. When the dimension is at least as large as the vocabulary, transformers can perfectly maintain token counts. However, when the vocabulary exceeds the embedding dimension, the interference between non-orthogonal token representations forces the network weights to scale polynomially. This renders the exact counting algorithm numerically unstable and practically unlearnable. We empirically…

Peer Reviews

Decision·Submitted to ICLR 2025

Reviewer 01Rating 5Confidence 4

Strengths

This paper studies an important question on the counting ability of Transformers. An interesting construction is proposed to address the query counting problem using Transformer architecture.

Weaknesses

Although some theoretical discussion are provided for the proposed construction, the construction itself is only a toy model and may be too simple to reflect the ability of realistic Transformers. Also, the fact that this particular construction cannot achieve certain tasks does not indicate that there does not exist a construction that can. Plus, there are too many loose ends in the proofs (see Questions below).

Reviewer 02Rating 6Confidence 3

Strengths

1. The presented theory is very clear. The authors explain their theoretical contribution with very intuitive argument. 2. The theoretical argument that vocabulary size and context length jointly blocks the learning of counting is well supported by empirical experiments.

Weaknesses

1. The work is mostly constructive so it remains unclear whether Transformers will converge to either of the solution. A mechanistic investigation as mentioned in the conclusion will be a great supplement for the paper. 2. The width bottleneck in the second construction seems to hold only for 1-layer MLP. 3. Technically, the argument that position encoding is necessary only holds for encoder-based model or causal model with 1-layer, a point that should be made clear in the paper.

Reviewer 03Rating 6Confidence 5

Strengths

1. This work focuses on the counting task for the language models. The authors provide both theoretical and empirical results to demonstrate the limitations of LLMs when the dimension is small. 2. This paper is well-presented and easy for the readers to follow.

Weaknesses

1. **Lack of Generality:** While the paper focuses on the counting task, its impact on real-world applications is unclear. The conclusions are specific to counting and may not generalize well to broader contexts. 2. The study primarily analyzes one-layer transformers, leaving the capabilities of multi-layer transformers unexplored. Further **theoretical** investigation is needed to understand how additional layers might influence performance on counting tasks.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsTopic Modeling · Language and cultural evolution · Natural Language Processing Techniques

MethodsFocus