CAT: Concept-level backdoor ATtacks for Concept Bottleneck Models

Songning Lai; Jiayu Yang; Yu Huang; Lijie Hu; Tianlang Xue; Zhangyi Hu; Jiaxu Li; Haicheng Liao; Yutao Yue

arXiv:2410.04823·cs.CV·August 19, 2025

CAT: Concept-level backdoor ATtacks for Concept Bottleneck Models

Songning Lai, Jiayu Yang, Yu Huang, Lijie Hu, Tianlang Xue, Zhangyi Hu, Jiaxu Li, Haicheng Liao, Yutao Yue

PDF

Open Access 3 Reviews

TL;DR

This paper introduces CAT and CAT+, novel concept-level backdoor attack methods for Concept Bottleneck Models, revealing security vulnerabilities and providing evaluation tools to measure attack effectiveness and stealthiness.

Contribution

It presents the first concept-level backdoor attack framework for CBMs, including an enhanced version with optimized trigger selection, and offers a comprehensive evaluation methodology.

Findings

01

High attack success rates with minimal impact on clean data

02

Effective and stealthy concept trigger selection via correlation function

03

Demonstrated security risks in CBMs and the need for robust defenses

Abstract

Despite the transformative impact of deep learning across multiple domains, the inherent opacity of these models has driven the development of Explainable Artificial Intelligence (XAI). Among these efforts, Concept Bottleneck Models (CBMs) have emerged as a key approach to improve interpretability by leveraging high-level semantic information. However, CBMs, like other machine learning models, are susceptible to security threats, particularly backdoor attacks, which can covertly manipulate model behaviors. Understanding that the community has not yet studied the concept level backdoor attack of CBM, because of "Better the devil you know than the devil you don't know.", we introduce CAT (Concept-level Backdoor ATtacks), a methodology that leverages the conceptual representations within CBMs to embed triggers during training, enabling controlled manipulation of model predictions at…

Peer Reviews

Decision·Submitted to ICLR 2025

Reviewer 01Rating 5Confidence 3

Strengths

- This paper shows a pioneering effort in investigating backdoor threats against CBMs. - It provides a well-rounded analysis of the attack, encompassing both empirical evidence and theoretical insights.

Weaknesses

### 1. Writing Quality The authors' attempts to make the paper visually engaging, such as including a cute icon and a notable saying at the start, are appreciated. However, while these elements add charm to the introduction, the main body lacks the same level of engagement. I would encourage the authors to focus more on enriching the scientific content, rather than on decorative elements. - **Clarity of Positioning**: My primary concern is that the paper’s positioning is unclear. When reading

Reviewer 02Rating 3Confidence 4

Strengths

The paper identifies a previously unexplored backdoor attack in CBMs. The experiment covers multiple datasets, parameters (trigger size, injection rate), and different target classes.

Weaknesses

CBM essentially consists of two parts. The first part is an encoder that converts raw data into concepts, and the second part is a linear layer that maps these concepts to the final category. In this attack, instead of working directly on the inputs, it operates on the converted concepts. During attack, a trigger function is used to apply a predefined static trigger to the concept, causing it to be misclassified into a specific target class. From my perspective, the second part of this process

Reviewer 03Rating 6Confidence 3

Strengths

1. The paper explores the vulnerability in concept bottleneck models by utilizing conceptual information, whose representations with triggers are not easily detectable. 2. The presentation of the paper is clear and easy to follow. 3. Some evaluation analyses are well-written.

Weaknesses

1. The paper validated their proposed attack on limited datasets. The paper only validates on two datasets (mostly on the CUB dataset), which greatly decreases the effectiveness of the proposed attack. As shown in Table 1 and the explanations presented, the attack success rate is highly related to the concept space. The authors should perform more evaluation on different datasets (e.g., CelebA dataset) to better understand the attack. Furthermore, the proposed CAT+ definitely needs more effort t

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsData Management and Algorithms · Advanced Database Systems and Queries · Recommender Systems and Techniques