A framework for analyzing concept representations in neural models
Burin Naowarat, Hao Tang, Sharon Goldwater

TL;DR
This paper introduces a unified framework to analyze how neural models represent concepts, focusing on containment and disentanglement, and evaluates different estimators and models in text and speech domains.
Contribution
It proposes a novel framework for analyzing concept subspaces in neural models and compares multiple estimators, revealing their impact on concept representation properties.
Findings
Concept subspaces are not always uniquely determined.
Estimator choice affects containment and disentanglement properties.
LEACE performs well but struggles with unseen data.
Abstract
Understanding how neural models represent human-interpretable concepts is challenging. Prior work has explored linear concept subspaces from diverse perspectives, such as probing and concept erasure. We introduce a unified framework to study these subspaces along two axes: \textit{containment}, which tests if a concept is fully represented in a subspace but not outside it, and \textit{disentanglement}, which tests for isolation from other concepts. In experiments on both text and speech models, we first highlight that concept subspaces may not be uniquely determined, and discuss the implications for concept subspace analysis. Then, we compare properties of concept subspaces estimated using five estimators, proposed in different communities. We find that (1) the choice of estimator impacts the containment and disentanglement properties; (2) the state-of-the-art concept erasure method,…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
