Concept Layers: Enhancing Interpretability and Intervenability via LLM   Conceptualization

Or Raphael Bidusa; Shaul Markovitch

arXiv:2502.13632·cs.LG·February 20, 2025

Concept Layers: Enhancing Interpretability and Intervenability via LLM Conceptualization

Or Raphael Bidusa, Shaul Markovitch

PDF

Open Access

TL;DR

This paper introduces Concept Layers, a novel method to improve interpretability and intervenability of LLMs by projecting internal representations into an explainable space and automatically selecting relevant concepts, without extensive architectural changes.

Contribution

The work proposes Concept Layers that integrate into existing models, eliminating the need for labeled concept datasets and enabling dynamic user interventions.

Findings

01

Maintains original model performance and agreement.

02

Enables meaningful model interventions.

03

Supports bias mitigation during inference.

Abstract

The opaque nature of Large Language Models (LLMs) has led to significant research efforts aimed at enhancing their interpretability, primarily through post-hoc methods. More recent in-hoc approaches, such as Concept Bottleneck Models (CBMs), offer both interpretability and intervenability by incorporating explicit concept representations. However, these methods suffer from key limitations, including reliance on labeled concept datasets and significant architectural modifications that challenges re-integration into existing system pipelines. In this work, we introduce a new methodology for incorporating interpretability and intervenability into an existing model by integrating Concept Layers (CLs) into its architecture. Our approach projects the model's internal vector representations into a conceptual, explainable vector space before reconstructing and feeding them back into the model.…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsSemantic Web and Ontologies · Business Process Modeling and Analysis · Service-Oriented Architecture and Web Services

MethodsOntology · Sparse Evolutionary Training