Understanding Counting Mechanisms in Large Language and Vision-Language Models

Hosein Hasani; Amirmohammad Izadi; Fatemeh Askari; Mobin Bagherian; Sadegh Mohammadian; Mohammad Izadi; Mahdieh Soleymani Baghshah

arXiv:2511.17699·cs.CV·April 21, 2026

Understanding Counting Mechanisms in Large Language and Vision-Language Models

Hosein Hasani, Amirmohammad Izadi, Fatemeh Askari, Mobin Bagherian, Sadegh Mohammadian, Mohammad Izadi, Mahdieh Soleymani Baghshah

PDF

TL;DR

This paper investigates how large language and vision-language models represent and process counting, revealing a layered, structured internal mechanism that relies on tokens, visual features, and structural cues.

Contribution

It introduces CountScope, a tool for interpretability, and uncovers the layerwise emergence of numerical representations and internal counters in LLMs and LVLMs.

Findings

01

Tokens and visual features encode count information.

02

Numerical representations emerge progressively across layers.

03

Models use structural cues like separators as counting shortcuts.

Abstract

Counting is one of the fundamental abilities of large language models (LLMs) and large vision-language models (LVLMs). This paper examines how these foundation models represent and compute numerical information in counting tasks. We use controlled experiments with repeated textual and visual items and analyze counting in LLMs and LVLMs through a set of behavioral, observational, and causal mediation analyses. To this end, we design a specialized tool, CountScope, for the mechanistic interpretability of numerical content. Results show that individual tokens or visual features encode latent positional count information that can be extracted and transferred across contexts. Layerwise analyses reveal a progressive emergence of numerical representations, with lower layers encoding small counts and higher layers representing larger ones. We identify an internal counter mechanism that updates…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.