Inference-Time Decomposition of Activations (ITDA): A Scalable Approach to Interpreting Large Language Models

Patrick Leask; Neel Nanda; Noura Al Moubayed

arXiv:2505.17769·cs.LG·June 13, 2025

Inference-Time Decomposition of Activations (ITDA): A Scalable Approach to Interpreting Large Language Models

Patrick Leask, Neel Nanda, Noura Al Moubayed

PDF

1 Repo

TL;DR

This paper introduces ITDA, a fast and resource-efficient method for decomposing large language model activations, enabling cross-model comparisons and interpretability with minimal training costs.

Contribution

ITDA offers a scalable, inference-time decomposition approach that requires significantly less training time and data than traditional SAEs, facilitating broader application and comparison across models.

Findings

01

ITDA achieves comparable reconstruction performance to SAEs on some models.

02

ITDA enables effective cross-model comparison using Jaccard similarity.

03

ITDA is trainable on large models with minimal computational resources.

Abstract

Sparse autoencoders (SAEs) are a popular method for decomposing Large Langage Models (LLM) activations into interpretable latents. However, due to their substantial training cost, most academic research uses open-source SAEs which are only available for a restricted set of models of up to 27B parameters. SAE latents are also learned from a dataset of activations, which means they do not transfer between models. Motivated by relative representation similarity measures, we introduce Inference-Time Decomposition of Activations (ITDA) models, an alternative method for decomposing language model activations. To train an ITDA, we greedily construct a dictionary of language model activations on a dataset of prompts, selecting those activations which were worst approximated by matching pursuit on the existing dictionary. ITDAs can be trained in just 1% of the time required for SAEs, using 1% of…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

pleask/itda
pytorchOfficial

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

MethodsSparse Evolutionary Training