Semantic Content Determines Algorithmic Performance
Marti\~no R\'ios-Garc\'ia, Nawaf Alampara, Kevin Maik Jablonka

TL;DR
This paper introduces WhatCounts, a benchmark to test whether language models' counting ability is invariant to semantic content, revealing significant semantic sensitivity and argument-dependent behavior in large language models.
Contribution
The paper presents WhatCounts, a novel atomic benchmark for isolating semantic sensitivity in counting tasks, exposing how LLMs' performance varies with semantic content.
Findings
LLMs show over 40% accuracy variation based on semantic content.
Semantic sensitivity shifts unpredictably with minimal fine-tuning.
LLMs' behavior suggests they approximate algorithms with argument-dependent biases.
Abstract
Counting should not depend on what is being counted; more generally, any algorithm's behavior should be invariant to the semantic content of its arguments. We introduce WhatCounts to test this property in isolation. Unlike prior work that conflates semantic sensitivity with reasoning complexity or prompt variation, WhatCounts is atomic: count items in an unambiguous, delimited list with no duplicates, distractors, or reasoning steps for different semantic types. Frontier LLMs show over 40% accuracy variation depending solely on what is being counted - cities versus chemicals, names versus symbols. Controlled ablations rule out confounds. The gap is semantic, and it shifts unpredictably with small amounts of unrelated fine-tuning. LLMs do not implement algorithms; they approximate them, and the approximation is argument-dependent. As we show with an agentic example, this has implications…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsBiomedical Text Mining and Ontologies · Ferroelectric and Negative Capacitance Devices · Language and cultural evolution
