Repeated-Token Counting Reveals a Dissociation Between Representations and Outputs
Sohan Venkatesh

TL;DR
Large language models accurately encode token counts internally, but a format-specific MLP block overwrites this information, causing counting failures despite correct internal representations.
Contribution
This paper demonstrates that counting failures are due to a format-triggered MLP overwriting correct counts, not a lack of internal representation.
Findings
Linear probes decode correct counts at all layers.
A format-triggered MLP overwrites counts in certain formats.
Counting failures are due to routing issues, not representation.
Abstract
Large language models fail at counting repeated tokens despite strong performance on broader reasoning benchmarks. These failures are commonly attributed to limitations in internal count tracking. We show this attribution is wrong. Linear probes on the residual stream decode the correct count with near-perfect accuracy at every post-embedding layer, across all model depths. This holds even at the exact layers where the wrong answer crystallizes while the model simultaneously outputs an incorrect count. Attention patterns show no evidence of collapse over repeated tokens and tokenization artifacts account for none of the failure. Instead, a format-triggered multi-layer perceptron (MLP) block overwrites the correctly-encoded count with a fixed wrong answer at roughly 88--93,% network depth. This prior fires for repeated word-tokens in space-separated list format and is absent for repeated…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
