Learning Concept Bottleneck Models from Mechanistic Explanations

Antonio De Santis; Schrasing Tong; Marco Brambilla; Lalana Kagal

arXiv:2603.07343·cs.LG·March 10, 2026

Learning Concept Bottleneck Models from Mechanistic Explanations

Antonio De Santis, Schrasing Tong, Marco Brambilla, Lalana Kagal

PDF

Open Access 3 Reviews

TL;DR

This paper introduces Mechanistic CBMs, a new approach that constructs interpretable concept bottlenecks directly from a black-box model’s learned concepts, improving interpretability and predictive power.

Contribution

We propose a novel CBM pipeline that extracts concepts from a black-box model using autoencoders and LLMs, enhancing interpretability and performance over prior methods.

Findings

01

M-CBMs outperform prior CBMs at matched sparsity

02

Improved concept prediction accuracy

03

Provides concise, interpretable explanations

Abstract

Concept Bottleneck Models (CBMs) aim for ante-hoc interpretability by learning a bottleneck layer that predicts interpretable concepts before the decision. State-of-the-art approaches typically select which concepts to learn via human specification, open knowledge graphs, prompting an LLM, or using general CLIP concepts. However, concepts defined a-priori may not have sufficient predictive power for the task or even be learnable from the available data. As a result, these CBMs often significantly trail their black-box counterpart when controlling for information leakage. To address this, we introduce a novel CBM pipeline named Mechanistic CBM (M-CBM), which builds the bottleneck directly from a black-box model's own learned concepts. These concepts are extracted via Sparse Autoencoders (SAEs) and subsequently named and annotated on a selected subset of images using a Multimodal LLM. For…

Peer Reviews

Decision·ICLR 2026 Poster

Reviewer 01Rating 4Confidence 4

Strengths

1. Clear, modular pipeline with concrete engineering choices. Training of the CBMs is standard and well-described. 2. NCC formalizes decision-level sparsity (concept contribution × weight) and can be computed per image/class. 3. Results tables/curves are easy to read.

Weaknesses

1. Methodological leakage / evaluation contamination. The paper explicitly annotates concepts on 20–30% of the test set (per concept) “solely for a final CBL evaluation.” But concept labels on test are produced using the same pipeline (SAE activation pre-selection + MLLM conditioned on highly activating images). This risks circularity and optimistic estimates of “concept learnability/consistency,” since test annotations are themselves guided by backbone activations and neuron saliency that were

Reviewer 02Rating 6Confidence 4

Strengths

**The baselines are recognized in the field and strong-performing.** The authors conduct their comparison in a simple and rigorous manner evaluating the same backbone encoder for each model on the same dataset. Considering that some datasets require a broader concept set to perform well on, the authors employ larger backbone models for all baselines, e.g., on ImageNet. **The idea is sound and well-presented.** I like the way how authors described their framework in Sec. 3. Effort in proposing a

Weaknesses

**Your NCC metric.** I do appreciate the introduced metric as the extension of the NEC metric from the prior work on VLG-CBM. As far as I understand, the parameter $\tau$ controls, let us say, the "sparsity" of the impactful concepts. Meaning that the model with smaller NCC (and also the smaller NEC metric) provides more explainable decision since only few the main important concepts are activated. So it is naturally good to keep both metrics relatively small, or contrary, a large NEC (NCC) can

Reviewer 03Rating 8Confidence 3

Strengths

- The methodology is intuitive, and its benefits are evident: leveraging concepts already present in the black-box model enhances interpretability. - The approach achieves performance improvements over baselines when controlling with NCC. - The paper is clearly written, and most aspects of the methodology are well justified (with the exception of the weaknesses noted below).

Weaknesses

While the motivation for NCC is intuitively explained (lines 303–305), it would be valuable to provide evidence for why the additional concepts are necessary. When concepts contribute substantially less than the dominant ones that NEC would capture, are they still useful and interpretable, or do they risk overfitting? NCC appears to strike a balance between interpretability and overfitting, but it is unclear how much the method benefits from this added flexibility. Would the same performance pat

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsExplainable Artificial Intelligence (XAI) · Adversarial Robustness in Machine Learning · Advanced Graph Neural Networks