Concept Layers: Enhancing Interpretability and Intervenability via LLM Conceptualization
Or Raphael Bidusa, Shaul Markovitch

TL;DR
This paper introduces Concept Layers, a novel method to improve interpretability and intervenability of LLMs by projecting internal representations into an explainable space and automatically selecting relevant concepts, without extensive architectural changes.
Contribution
The work proposes Concept Layers that integrate into existing models, eliminating the need for labeled concept datasets and enabling dynamic user interventions.
Findings
Maintains original model performance and agreement.
Enables meaningful model interventions.
Supports bias mitigation during inference.
Abstract
The opaque nature of Large Language Models (LLMs) has led to significant research efforts aimed at enhancing their interpretability, primarily through post-hoc methods. More recent in-hoc approaches, such as Concept Bottleneck Models (CBMs), offer both interpretability and intervenability by incorporating explicit concept representations. However, these methods suffer from key limitations, including reliance on labeled concept datasets and significant architectural modifications that challenges re-integration into existing system pipelines. In this work, we introduce a new methodology for incorporating interpretability and intervenability into an existing model by integrating Concept Layers (CLs) into its architecture. Our approach projects the model's internal vector representations into a conceptual, explainable vector space before reconstructing and feeding them back into the model.…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsSemantic Web and Ontologies · Business Process Modeling and Analysis · Service-Oriented Architecture and Web Services
MethodsOntology · Sparse Evolutionary Training
