Provably Extracting the Features from a General Superposition

Allen Liu

arXiv:2512.15987·cs.LG·April 1, 2026

Provably Extracting the Features from a General Superposition

Allen Liu

PDF

TL;DR

This paper presents an efficient algorithm for extracting features from complex superpositions in high-dimensional functions, advancing interpretability in machine learning models.

Contribution

It introduces a novel query-based method capable of recovering feature directions in overcomplete regimes with arbitrary response functions.

Findings

01

Algorithm successfully identifies feature directions in noisy settings.

02

Works in more general superposition scenarios than prior methods.

03

Handles arbitrary response functions with non-degenerate features.

Abstract

It is widely believed that complex machine learning models generally encode features through linear representations. This is the foundational hypothesis behind a vast body of work on interpretability. A key challenge toward extracting interpretable features, however, is that they exist in superposition. In this work, we study the question of extracting features in superposition from a learning theoretic perspective. We start with the following fundamental setting: we are given query access to a function \[ f(x)=\sum_{i=1}^n \sigma_i(v_i^\top x), \] where each unit vector $v_{i}$ encodes a feature direction and $σ_{i} : R \to R$ is an arbitrary response function and our goal is to recover the $v_{i}$ and the function $f$ . In learning-theoretic terms, superposition refers to the \emph{overcomplete regime}, when the number of features is larger than the underlying dimension (i.e. $n > d$ ),…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.