Sparse Concept Anchoring for Interpretable and Controllable Neural Representations
Sandy Fraser, Patryk Wielopolski

TL;DR
Sparse Concept Anchoring biases neural latent spaces to isolate specific concepts with minimal supervision, enabling interpretable and controllable behaviors such as concept removal and behavioral steering.
Contribution
The paper introduces a novel method for anchoring concepts in neural representations using minimal labels, enhancing interpretability and controllability of learned features.
Findings
Selective attenuation of targeted concepts with negligible impact on others.
Complete concept removal with near-theoretical minimum reconstruction error.
Enables reversible and permanent interventions in neural representations.
Abstract
We introduce Sparse Concept Anchoring, a method that biases latent space to position a targeted subset of concepts while allowing others to self-organize, using only minimal supervision (labels for <0.1% of examples per anchored concept). Training combines activation normalization, a separation regularizer, and anchor or subspace regularizers that attract rare labeled examples to predefined directions or axis-aligned subspaces. The anchored geometry enables two practical interventions: reversible behavioral steering that projects out a concept's latent component at inference, and permanent removal via targeted weight ablation of anchored dimensions. Experiments on structured autoencoders show selective attenuation of targeted concepts with negligible impact on orthogonal features, and complete elimination with reconstruction error approaching theoretical bounds. Sparse Concept Anchoring…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
