Sparse Auto-Encoders and Holism about Large Language Models
Jumbly Grindrod

TL;DR
This paper examines whether large language models support a holistic view of meaning or if recent interpretability findings suggest a decompositional approach, ultimately defending the holistic perspective.
Contribution
It introduces the role of sparse auto-encoder features in challenging holistic interpretations and argues that the holistic view remains valid if features are countable.
Findings
Discovery of interpretable latent features in LLMs.
Features suggest an alternative decompositional view of meaning.
Holistic interpretation remains plausible with countable features.
Abstract
Does Large Language Model (LLM) technology suggest a meta-semantic picture i.e. a picture of how words and complex expressions come to have the meaning that they do? One modest approach explores the assumptions that seem to be built into how LLMs capture the meanings of linguistic expressions as a way of considering their plausibility (Grindrod, 2026a, 2026b). It has previously been argued that LLMs, in employing a form of distributional semantics, adopt a form of holism about meaning (Grindrod, 2023; Grindrod et al., forthcoming). However, recent work in mechanistic interpretability presents a challenge to these arguments. Specifically, the discovery of a vast array of interpretable latent features within the high dimensional spaces used by LLMs potentially challenges the holistic interpretation. In this paper, I will present the original reasons for thinking that LLMs embody a form of…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
