Provably Extracting the Features from a General Superposition
Allen Liu

TL;DR
This paper presents an efficient algorithm for extracting features from complex superpositions in high-dimensional functions, advancing interpretability in machine learning models.
Contribution
It introduces a novel query-based method capable of recovering feature directions in overcomplete regimes with arbitrary response functions.
Findings
Algorithm successfully identifies feature directions in noisy settings.
Works in more general superposition scenarios than prior methods.
Handles arbitrary response functions with non-degenerate features.
Abstract
It is widely believed that complex machine learning models generally encode features through linear representations. This is the foundational hypothesis behind a vast body of work on interpretability. A key challenge toward extracting interpretable features, however, is that they exist in superposition. In this work, we study the question of extracting features in superposition from a learning theoretic perspective. We start with the following fundamental setting: we are given query access to a function \[ f(x)=\sum_{i=1}^n \sigma_i(v_i^\top x), \] where each unit vector encodes a feature direction and is an arbitrary response function and our goal is to recover the and the function . In learning-theoretic terms, superposition refers to the \emph{overcomplete regime}, when the number of features is larger than the underlying dimension (i.e. ),…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
