Concept Bottleneck Large Language Models

Chung-En Sun; Tuomas Oikarinen; Berk Ustun; Tsui-Wei Weng

arXiv:2412.07992·cs.CL·September 9, 2025·2 cites

Concept Bottleneck Large Language Models

Chung-En Sun, Tuomas Oikarinen, Berk Ustun, Tsui-Wei Weng

PDF

Open Access 1 Repo 2 Models 1 Video 3 Reviews

TL;DR

Concept Bottleneck Large Language Models (CB-LLMs) introduce intrinsic interpretability into LLMs, enabling transparent reasoning, safer outputs, and improved control over text generation and classification tasks.

Contribution

This paper presents the first framework for inherently interpretable LLMs through concept bottlenecks, enhancing transparency and safety in NLP applications.

Findings

01

CB-LLMs match or outperform traditional models in text classification.

02

Interpretable neurons enable precise concept detection and controlled generation.

03

CB-LLMs improve safety and trustworthiness by allowing transparent identification of harmful content.

Abstract

We introduce Concept Bottleneck Large Language Models (CB-LLMs), a novel framework for building inherently interpretable Large Language Models (LLMs). In contrast to traditional black-box LLMs that rely on limited post-hoc interpretations, CB-LLMs integrate intrinsic interpretability directly into the LLMs -- allowing accurate explanations with scalability and transparency. We build CB-LLMs for two essential NLP tasks: text classification and text generation. In text classification, CB-LLMs is competitive with, and at times outperforms, traditional black-box models while providing explicit and interpretable reasoning. For the more challenging task of text generation, interpretable neurons in CB-LLMs enable precise concept detection, controlled generation, and safer outputs. The embedded interpretability empowers users to transparently identify harmful content, steer model behavior, and…

Peer Reviews

Decision·ICLR 2025 Poster

Reviewer 01Rating 5Confidence 3

Strengths

1. CB-LLM demonstrates outstanding performance in text classification tasks. After the introduction of Automatic Concept Correction (ACC), the model surpasses traditional black-box models on several datasets, showcasing its strong potential. 2. During the training process for text classification, CB-LLM effectively removes negative activation values using the ReLU activation function, significantly reducing semantic ambiguity during training and enhancing model stability. 3. The authors provide

Weaknesses

1. In the CB-LLM for text classification , the introduction of sparsity constraints during training is not adequately explained. If this constraint aims to enhance the model's interpretability, the authors should provide more analysis and experimental evidence to support this claim. 2. The authors used only RoBERTa and GPT-2 as backbone models in their experiments. Given that the main goal of the paper is to improve the internal interpretability of large language models (LLMs) in text classifica

Reviewer 02Rating 6Confidence 3

Strengths

- The paper presents a novel method for efficient interpretation in text classification tasks and proposes the first method for interpretation in text generation tasks. - Both methods have somewhat practical applications for real-world tasks. - The paper is easy to follow and includes rich figures and examples.

Weaknesses

1. There is limited discussion about the generalizability of interpretable layers across various domains or tasks, specifically regarding whether concepts learned in classification or generation from one dataset can be reused in another. This lack of focus on transfer learning limits the model's broader applicability to related but unseen tasks. Moreover, the method is only tested on LLaMA-3-8B, so it is also important to investigate whether it consistently works across different model families

Reviewer 03Rating 6Confidence 3

Strengths

The text generation part is simple but interesting.

Weaknesses

The first part of the paper presents how CBL is used in text classification. I found this part unsatisfying. * The core idea actually is to learn a mapping between concepts generated by GPT and the label set, and thus it seems to me that this part can be explained easily in a much shorter text. Especially, the whole idea of step 3 (correction), written in more than half a page, is simply to zero out those concepts that aren't generated by GPT given a target label -- I believe we just need a par

Code & Models

Repositories

trustworthy-ml-lab/cb-llms
pytorchOfficial

Models

Videos

Concept Bottleneck Large Language Models· slideslive

Taxonomy

TopicsTopic Modeling · Bayesian Modeling and Causal Inference · Data Quality and Management