Sparse Concept Anchoring for Interpretable and Controllable Neural Representations

Sandy Fraser; Patryk Wielopolski

arXiv:2512.12469·cs.LG·April 28, 2026

Sparse Concept Anchoring for Interpretable and Controllable Neural Representations

Sandy Fraser, Patryk Wielopolski

PDF

TL;DR

Sparse Concept Anchoring biases neural latent spaces to isolate specific concepts with minimal supervision, enabling interpretable and controllable behaviors such as concept removal and behavioral steering.

Contribution

The paper introduces a novel method for anchoring concepts in neural representations using minimal labels, enhancing interpretability and controllability of learned features.

Findings

01

Selective attenuation of targeted concepts with negligible impact on others.

02

Complete concept removal with near-theoretical minimum reconstruction error.

03

Enables reversible and permanent interventions in neural representations.

Abstract

We introduce Sparse Concept Anchoring, a method that biases latent space to position a targeted subset of concepts while allowing others to self-organize, using only minimal supervision (labels for <0.1% of examples per anchored concept). Training combines activation normalization, a separation regularizer, and anchor or subspace regularizers that attract rare labeled examples to predefined directions or axis-aligned subspaces. The anchored geometry enables two practical interventions: reversible behavioral steering that projects out a concept's latent component at inference, and permanent removal via targeted weight ablation of anchored dimensions. Experiments on structured autoencoders show selective attenuation of targeted concepts with negligible impact on orthogonal features, and complete elimination with reconstruction error approaching theoretical bounds. Sparse Concept Anchoring…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.