Jacobian Sparse Autoencoders: Sparsify Computations, Not Just Activations
Lucy Farnik, Tim Lawson, Conor Houghton, Laurence Aitchison

TL;DR
This paper introduces Jacobian Sparse Autoencoders (JSAEs) that sparsify not only activations but also the computations in large language models, enabling better understanding of their internal processes while maintaining performance.
Contribution
JSAEs are a novel method that efficiently sparsify the Jacobian matrices in LLMs, revealing computational sparsity and improving interpretability over traditional SAEs.
Findings
JSAEs extract significant computational sparsity while preserving model performance.
Jacobian matrices serve as effective proxies for computational sparsity.
Pre-trained LLMs exhibit greater computational sparsity than randomized models.
Abstract
Sparse autoencoders (SAEs) have been successfully used to discover sparse and human-interpretable representations of the latent activations of LLMs. However, we would ultimately like to understand the computations performed by LLMs and not just their representations. The extent to which SAEs can help us understand computations is unclear because they are not designed to "sparsify" computations in any sense, only latent activations. To solve this, we propose Jacobian SAEs (JSAEs), which yield not only sparsity in the input and output activations of a given model component but also sparsity in the computation (formally, the Jacobian) connecting them. With a na\"ive implementation, the Jacobians in LLMs would be computationally intractable due to their size. One key technical contribution is thus finding an efficient way of computing Jacobians in this setup. We find that JSAEs extract a…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsModel Reduction and Neural Networks · Advanced Vision and Imaging · Computer Graphics and Visualization Techniques
