Compositional Literary Primitives in Instruction-Tuned LLMs: Cross-Architectural SAE Features for Self, Style, and Affect
Joao Paulo Cavalcante Presa, Savio Salvarino Teles de Oliveira

TL;DR
This paper identifies and characterizes compositional literary primitives in instruction-tuned large language models, revealing how specific features influence affect, style, and self-representation across two architectures.
Contribution
It introduces a novel sparse autoencoder-based method to uncover emergent feature classes and demonstrates their role in emotion and style control in LLMs.
Findings
Llama achieves full emotion coverage with combined features.
Gemma covers 23 out of 27 emotions, mainly through scene and imagery.
Cross-architectural asymmetry affects affect naming and evocation.
Abstract
We characterize a compositional architecture of literary primitives in two instruction-tuned large language models (Llama 3.1 8B-Instruct and Gemma 2 9B-IT) via sparse autoencoders on mid-depth residual streams. Four feature classes emerge: naming-gates that promote lexical tokens of a target affect, an eleven-self cluster of first-person register features, stylistic register modulators (show-don't-tell and defamiliarization), and compositional emotions that arise only from multi-feature steering. Under a forced-choice 5-LLM judge panel applied to a 27-category emotion taxonomy (Cowen-Keltner), Llama reaches full 27/27 coverage by combining naming-gates, multi-feature recipes, and single self-feature steering; Gemma reaches 23/27 with adoration as the single residual strict-fail. Under random judging, the per-cell pass probability is on the order of and the expected number of…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
