Sequential Enumeration in Large Language Models
Kuinan Hou, Marco Zorzi, Alberto Testolin

TL;DR
This paper investigates whether large language models can systematically perform sequence enumeration and counting, revealing that they require explicit prompts and do not spontaneously count, indicating a gap with symbolic systems.
Contribution
The study provides a comprehensive evaluation of state-of-the-art LLMs' ability to perform counting and enumeration tasks, highlighting their limitations and the role of prompting.
Findings
Some LLMs can deploy counting with explicit prompts
LLMs do not spontaneously count in simple enumeration tasks
Counting abilities do not scale predictably with model size
Abstract
Reliably counting and generating sequences of items remain a significant challenge for neural networks, including Large Language Models (LLMs). Indeed, although this capability is readily handled by rule-based symbolic systems based on serial computation, learning to systematically deploy counting procedures is difficult for neural models, which should acquire these skills through learning. Previous research has demonstrated that recurrent architectures can only approximately track and enumerate sequences of events, and it remains unclear whether modern deep learning systems, including LLMs, can deploy systematic counting procedures over sequences of discrete symbols. This paper aims to fill this gap by investigating the sequential enumeration abilities of five state-of-the-art LLMs, including proprietary, open-source, and reasoning models. We probe LLMs in sequential naming and…
Peer Reviews
Decision·ICLR 2026 Conference Withdrawn Submission
This case study on counting demonstrates the importance of prompt instructions and that LLMs can be sensitive to instructions about how they should compute things internally. It also provides some insights into how LLMs keep track of counts of words.
The contributions of this paper ultimately seem somewhat narrow, as they are restricted to counting-based tasks. It is unclear why it is important to specifically understand LLM performance on these tasks. In addition, while the analysis of the model internals is interesting, it seems somewhat shallow and it's not clear what to take away. A strong conference paper should at least either have deep insights (even about a narrow task) or analyze a diverse set of important tasks (even if somewhat sh
- Clarity: The paper is well-written, with a clear and logical structure. - Significance: Systematically demonstrating a critical failure mode in modern LLMs: the inability to spontaneously deploy a basic, procedural algorithm like counting.
- Familiar findings: The paper's main conclusion—that LLMs struggle with counting—confirms a known limitation, which may obscure the work's core novelty. - The introduction does not show the paper's most compelling result. The paper's key intellectual contribution is not apparent until late in the results section, weakening the narrative and its initial impact. - Limited Scope of Mechanistic Analysis: PCA and neuron analysis is performed only on one model, Llama-70B. While this provides a fasci
- Experiments are well-designed and considered many aspects of sequential enumeration, with clear descriptions of tasks that cover many situations in sequential enumeration. - Diverse types of Prompts (Explicit, Spontaneous, Mental, Forbid) were implemented and studied. - The approach presented in sections 3.5, results in 4.2, and Figures 3,4 provided interesting analysis on the hidden states of the LLMs during enumeration.
Although the paper is mainly on evaluating the ability of the LLMs, it can be improved if a strategy to improve the ability related to sequential enumeration of the LLMs is discussed.
It is a **well-posed question** and carefully designed **descriptive** analyses that examine whether current LLMs can count. - tests multiple models, task types, and prompting setups. - usings PCAs in the Lama model to reveal internal strategies for counting. - ensures one-token-per-word to isolate counting from tokenization issues
Besides the strong descriptive analysis showing the limitations of LLMs, the study does not go deeper to investigate where the counting ability comes from. Different model architectures may encode numerical skills through distinct mechanisms, and performance may also reflect biases learned from training data frequency rather than systematic counting ability. Without separating these factors, the paper cannot fully explain how or why counting emerges in current models. - some experimental desig
- The paper studies a known yet interesting phenomenon of the counting ability of LLMs. - Looking beyond accuracy to hidden-state patterns is a useful angle.
- Novelty concern: I don’t think the paper proposes anything new to the literature. The results (explicit > mental > spontaneous > forbid) feel expected. If the main takeaway is “use explicit counting prompts,” the contribution is limited without deeper analysis of why or any causal evidence. - Word list transparency: The datasets are synthetic and rely on 5-letter, one-token words, but the exact list and counts are not provided. Please report: (i) how many candidate words were considered; (ii)
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsTopic Modeling · Ferroelectric and Negative Capacitance Devices · Language and cultural evolution
