TL;DR
Concept Bottleneck Large Language Models (CB-LLMs) introduce intrinsic interpretability into LLMs, enabling transparent reasoning, safer outputs, and improved control over text generation and classification tasks.
Contribution
This paper presents the first framework for inherently interpretable LLMs through concept bottlenecks, enhancing transparency and safety in NLP applications.
Findings
CB-LLMs match or outperform traditional models in text classification.
Interpretable neurons enable precise concept detection and controlled generation.
CB-LLMs improve safety and trustworthiness by allowing transparent identification of harmful content.
Abstract
We introduce Concept Bottleneck Large Language Models (CB-LLMs), a novel framework for building inherently interpretable Large Language Models (LLMs). In contrast to traditional black-box LLMs that rely on limited post-hoc interpretations, CB-LLMs integrate intrinsic interpretability directly into the LLMs -- allowing accurate explanations with scalability and transparency. We build CB-LLMs for two essential NLP tasks: text classification and text generation. In text classification, CB-LLMs is competitive with, and at times outperforms, traditional black-box models while providing explicit and interpretable reasoning. For the more challenging task of text generation, interpretable neurons in CB-LLMs enable precise concept detection, controlled generation, and safer outputs. The embedded interpretability empowers users to transparently identify harmful content, steer model behavior, and…
Peer Reviews
Decision·ICLR 2025 Poster
1. CB-LLM demonstrates outstanding performance in text classification tasks. After the introduction of Automatic Concept Correction (ACC), the model surpasses traditional black-box models on several datasets, showcasing its strong potential. 2. During the training process for text classification, CB-LLM effectively removes negative activation values using the ReLU activation function, significantly reducing semantic ambiguity during training and enhancing model stability. 3. The authors provide
1. In the CB-LLM for text classification , the introduction of sparsity constraints during training is not adequately explained. If this constraint aims to enhance the model's interpretability, the authors should provide more analysis and experimental evidence to support this claim. 2. The authors used only RoBERTa and GPT-2 as backbone models in their experiments. Given that the main goal of the paper is to improve the internal interpretability of large language models (LLMs) in text classifica
- The paper presents a novel method for efficient interpretation in text classification tasks and proposes the first method for interpretation in text generation tasks. - Both methods have somewhat practical applications for real-world tasks. - The paper is easy to follow and includes rich figures and examples.
1. There is limited discussion about the generalizability of interpretable layers across various domains or tasks, specifically regarding whether concepts learned in classification or generation from one dataset can be reused in another. This lack of focus on transfer learning limits the model's broader applicability to related but unseen tasks. Moreover, the method is only tested on LLaMA-3-8B, so it is also important to investigate whether it consistently works across different model families
The text generation part is simple but interesting.
The first part of the paper presents how CBL is used in text classification. I found this part unsatisfying. * The core idea actually is to learn a mapping between concepts generated by GPT and the label set, and thus it seems to me that this part can be explained easily in a much shorter text. Especially, the whole idea of step 3 (correction), written in more than half a page, is simply to zero out those concepts that aren't generated by GPT given a target label -- I believe we just need a par
Code & Models
Videos
Taxonomy
TopicsTopic Modeling · Bayesian Modeling and Causal Inference · Data Quality and Management
