Learning Concept Bottleneck Models from Mechanistic Explanations
Antonio De Santis, Schrasing Tong, Marco Brambilla, Lalana Kagal

TL;DR
This paper introduces Mechanistic CBMs, a new approach that constructs interpretable concept bottlenecks directly from a black-box model’s learned concepts, improving interpretability and predictive power.
Contribution
We propose a novel CBM pipeline that extracts concepts from a black-box model using autoencoders and LLMs, enhancing interpretability and performance over prior methods.
Findings
M-CBMs outperform prior CBMs at matched sparsity
Improved concept prediction accuracy
Provides concise, interpretable explanations
Abstract
Concept Bottleneck Models (CBMs) aim for ante-hoc interpretability by learning a bottleneck layer that predicts interpretable concepts before the decision. State-of-the-art approaches typically select which concepts to learn via human specification, open knowledge graphs, prompting an LLM, or using general CLIP concepts. However, concepts defined a-priori may not have sufficient predictive power for the task or even be learnable from the available data. As a result, these CBMs often significantly trail their black-box counterpart when controlling for information leakage. To address this, we introduce a novel CBM pipeline named Mechanistic CBM (M-CBM), which builds the bottleneck directly from a black-box model's own learned concepts. These concepts are extracted via Sparse Autoencoders (SAEs) and subsequently named and annotated on a selected subset of images using a Multimodal LLM. For…
Peer Reviews
Decision·ICLR 2026 Poster
1. Clear, modular pipeline with concrete engineering choices. Training of the CBMs is standard and well-described. 2. NCC formalizes decision-level sparsity (concept contribution × weight) and can be computed per image/class. 3. Results tables/curves are easy to read.
1. Methodological leakage / evaluation contamination. The paper explicitly annotates concepts on 20–30% of the test set (per concept) “solely for a final CBL evaluation.” But concept labels on test are produced using the same pipeline (SAE activation pre-selection + MLLM conditioned on highly activating images). This risks circularity and optimistic estimates of “concept learnability/consistency,” since test annotations are themselves guided by backbone activations and neuron saliency that were
**The baselines are recognized in the field and strong-performing.** The authors conduct their comparison in a simple and rigorous manner evaluating the same backbone encoder for each model on the same dataset. Considering that some datasets require a broader concept set to perform well on, the authors employ larger backbone models for all baselines, e.g., on ImageNet. **The idea is sound and well-presented.** I like the way how authors described their framework in Sec. 3. Effort in proposing a
**Your NCC metric.** I do appreciate the introduced metric as the extension of the NEC metric from the prior work on VLG-CBM. As far as I understand, the parameter $\tau$ controls, let us say, the "sparsity" of the impactful concepts. Meaning that the model with smaller NCC (and also the smaller NEC metric) provides more explainable decision since only few the main important concepts are activated. So it is naturally good to keep both metrics relatively small, or contrary, a large NEC (NCC) can
- The methodology is intuitive, and its benefits are evident: leveraging concepts already present in the black-box model enhances interpretability. - The approach achieves performance improvements over baselines when controlling with NCC. - The paper is clearly written, and most aspects of the methodology are well justified (with the exception of the weaknesses noted below).
While the motivation for NCC is intuitively explained (lines 303–305), it would be valuable to provide evidence for why the additional concepts are necessary. When concepts contribute substantially less than the dominant ones that NEC would capture, are they still useful and interpretable, or do they risk overfitting? NCC appears to strike a balance between interpretability and overfitting, but it is unclear how much the method benefits from this added flexibility. Would the same performance pat
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsExplainable Artificial Intelligence (XAI) · Adversarial Robustness in Machine Learning · Advanced Graph Neural Networks
