TL;DR
This paper introduces ITDA, a fast and resource-efficient method for decomposing large language model activations, enabling cross-model comparisons and interpretability with minimal training costs.
Contribution
ITDA offers a scalable, inference-time decomposition approach that requires significantly less training time and data than traditional SAEs, facilitating broader application and comparison across models.
Findings
ITDA achieves comparable reconstruction performance to SAEs on some models.
ITDA enables effective cross-model comparison using Jaccard similarity.
ITDA is trainable on large models with minimal computational resources.
Abstract
Sparse autoencoders (SAEs) are a popular method for decomposing Large Langage Models (LLM) activations into interpretable latents. However, due to their substantial training cost, most academic research uses open-source SAEs which are only available for a restricted set of models of up to 27B parameters. SAE latents are also learned from a dataset of activations, which means they do not transfer between models. Motivated by relative representation similarity measures, we introduce Inference-Time Decomposition of Activations (ITDA) models, an alternative method for decomposing language model activations. To train an ITDA, we greedily construct a dictionary of language model activations on a dataset of prompts, selecting those activations which were worst approximated by matching pursuit on the existing dictionary. ITDAs can be trained in just 1% of the time required for SAEs, using 1% of…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
MethodsSparse Evolutionary Training
