When the Coffee Feature Activates on Coffins: An Analysis of Feature Extraction and Steering for Mechanistic Interpretability
Raphael Ronge, Markus Maier, Frederick Eberhardt

TL;DR
This paper critically evaluates the effectiveness of sparse autoencoders in extracting human-interpretable features from language models, revealing significant fragility and limitations that challenge their reliability for AI safety applications.
Contribution
It replicates prior feature extraction results with open-source models and highlights the fragility and limitations of current interpretability methods for safety-critical use cases.
Findings
Feature steering is sensitive to layer and context.
Activation behaviors are non-standard and complex.
Current methods lack systematic reliability for safety.
Abstract
Recent work by Anthropic on Mechanistic interpretability claims to understand and control Large Language Models by extracting human-interpretable features from their neural activation patterns using sparse autoencoders (SAEs). If successful, this approach offers one of the most promising routes for human oversight in AI safety. We conduct an initial stress-test of these claims by replicating their main results with open-source SAEs for Llama 3.1. While we successfully reproduce basic feature extraction and steering capabilities, our investigation suggests that major caution is warranted regarding the generalizability of these claims. We find that feature steering exhibits substantial fragility, with sensitivity to layer selection, steering magnitude, and context. We observe non-standard activation behavior and demonstrate the difficulty to distinguish thematically similar features from…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsExplainable Artificial Intelligence (XAI) · Adversarial Robustness in Machine Learning · Generative Adversarial Networks and Image Synthesis
