Sparse Auto-Encoders and Holism about Large Language Models

Jumbly Grindrod

arXiv:2603.26207·cs.CL·March 30, 2026

Sparse Auto-Encoders and Holism about Large Language Models

Jumbly Grindrod

PDF

TL;DR

This paper examines whether large language models support a holistic view of meaning or if recent interpretability findings suggest a decompositional approach, ultimately defending the holistic perspective.

Contribution

It introduces the role of sparse auto-encoder features in challenging holistic interpretations and argues that the holistic view remains valid if features are countable.

Findings

01

Discovery of interpretable latent features in LLMs.

02

Features suggest an alternative decompositional view of meaning.

03

Holistic interpretation remains plausible with countable features.

Abstract

Does Large Language Model (LLM) technology suggest a meta-semantic picture i.e. a picture of how words and complex expressions come to have the meaning that they do? One modest approach explores the assumptions that seem to be built into how LLMs capture the meanings of linguistic expressions as a way of considering their plausibility (Grindrod, 2026a, 2026b). It has previously been argued that LLMs, in employing a form of distributional semantics, adopt a form of holism about meaning (Grindrod, 2023; Grindrod et al., forthcoming). However, recent work in mechanistic interpretability presents a challenge to these arguments. Specifically, the discovery of a vast array of interpretable latent features within the high dimensional spaces used by LLMs potentially challenges the holistic interpretation. In this paper, I will present the original reasons for thinking that LLMs embody a form of…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.