The Counting Power of Transformers
Marco S\"alzer, Chris K\"ocher, Alexander Kozachinskiy, Georg Zetzsche, Anthony Widjaja Lin

TL;DR
This paper establishes that transformers can express highly nonlinear counting properties, extending beyond linear ones, and introduces a formal framework to analyze their counting capabilities, including undecidability results and experimental validation.
Contribution
It provides a formal framework showing transformers' ability to capture all semialgebraic counting properties, surpassing previous linear limitations, and explores implications for model analysis.
Findings
Transformers can express all semialgebraic counting properties.
Decidability results related to transformer analysis are established.
Experimental validation confirms trainability of nonlinear counting properties.
Abstract
Counting properties (e.g. determining whether certain tokens occur more than other tokens in a given input text) have played a significant role in the study of expressiveness of transformers. In this paper, we provide a formal framework for investigating the counting power of transformers. We argue that all existing results demonstrate transformers' expressivity only for (semi-)linear counting properties, i.e., which are expressible as a boolean combination of linear inequalities. Our main result is that transformers can express counting properties that are highly nonlinear. More precisely, we prove that transformers can capture all semialgebraic counting properties, i.e., expressible as a boolean combination of arbitrary multivariate polynomials (of any degree). Among others, these generalize the counting properties that can be captured by C-RASP softmax transformers, which capture…
Peer Reviews
Decision·ICLR 2026 Poster
* Well-motivated and nicely written paper. While technically strong, its pitch and claims will also be accessible to broader audience not deep into theoretical research. * I liked the structure of the paper and how it (quickly) conveys the findings and uses many examples to explain concepts. * That the studied transformer model uses simple (or no) position encodings and the standard softmax attention (either directly or through AHAT[U] which is a special case of softmax) is a positive, especia
* While it's valuable to have an **empirical validation**, I felt that section 6 is not as well described and discussed as the rest of the paper. E.g., even the metrics mentioned in the caption of Fig 1 need clarification. * There are some places where it would be valuable to state which **design decisions / assumptions / choices** play an important role. E.g., it seems to me that Prop 4.1 (that whatever NoPE-AHAT can compute is expressible semi-algebraically) relies on the assumption of ReLU a
- Understanding the expressive power of Transformers is an important research topic. This paper presents new results on counting capabilities that were not clarified in previous work. - The paper is concise and readable. - The theoretical results are supported by experiments. - Although I had some questions, the theoretical part appears mostly correct.
- It is unclear how practical it is to apply Transformers to nonlinear counting problems as discussed in this paper. For sequences such as text, which Transformers typically handle, input order is important, and tasks are generally not permutation-invariant. Therefore, studying permutation-invariant input properties may have limited practical relevance. As the authors discuss in the paragraph beginning at Line 263, combining counting properties with other characteristics is interesti
1. The perspective on semialgebraic sets rather than semilinear sets is novel and generalizes prior results on the counting power of transformers 2. The corollary on inexpressibility of Parity is interesting and accompanies a rich body of work tackling this question Novel tools to me such as semialgebraic sets and Parikh images were introduced well enough for me to understand the technical parts of the paper 3. Introduction is well written and puts in perspective previous work on semilinear coun
1. Importantly, this paper disregards the impact of precision. In the finite-precision regime (which describes transformers used in practice), it is impossible to store counts from uniform attention for any input string. Recently, the expressive power of fixed-precision transformers (SMATs and AHATs) has already been characterized by a subclass of regular languages [Li and Cotterell, 2025] (and therefore can not perform counts across all possible strings), undermining the relevance of the paper’
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsHandwritten Text Recognition Techniques · Topic Modeling · Adversarial Robustness in Machine Learning
MethodsSoftmax · Attention Is All You Need
