Deep Concept Removal

Yegor Klochkov; Jean-Francois Ton; Ruocheng Guo; Yang Liu and; Hang Li

arXiv:2310.05755·cs.LG·October 10, 2023

Deep Concept Removal

Yegor Klochkov, Jean-Francois Ton, Ruocheng Guo, Yang Liu and, Hang Li

PDF

Open Access 3 Reviews

TL;DR

This paper introduces a novel adversarial method for removing specific concepts from deep neural network representations, enhancing fairness and out-of-distribution robustness.

Contribution

It proposes a new approach using adversarial linear classifiers and implicit gradient techniques for effective concept removal in deep networks.

Findings

01

Successful removal of targeted concepts in DRO benchmarks

02

Improved out-of-distribution generalization

03

Effective handling of concept entanglement

Abstract

We address the problem of concept removal in deep neural networks, aiming to learn representations that do not encode certain specified concepts (e.g., gender etc.) We propose a novel method based on adversarial linear classifiers trained on a concept dataset, which helps to remove the targeted attribute while maintaining model performance. Our approach Deep Concept Removal incorporates adversarial probing classifiers at various layers of the network, effectively addressing concept entanglement and improving out-of-distribution generalization. We also introduce an implicit gradient-based technique to tackle the challenges associated with adversarial training using linear classifiers. We evaluate the ability to remove a concept on a set of popular distributionally robust optimization (DRO) benchmarks with spurious correlations, as well as out-of-distribution (OOD) generalization tasks.

Peer Reviews

Decision·Submitted to ICLR 2024

Reviewer 01Rating 3· reject, not good enoughConfidence 4

Strengths

1. The proposed setting is novel and interesting. Concept removal is proposed as an intuitive extension to [1]. 2. The paper performs a comprehensive literature review and explains the relevant works needed to understand the paper in detail. 3. The research questions proposed by the work are insightful for understanding how concepts are embedded in the network architecture. The finding that adversarial CAV helps most when applied to layers before contraction is important. 4. The appendix provid

Weaknesses

1. At a high-level, the paper seems to mainly combine two earlier works (Elazar et al, 2018 and Kim et al, TCAV, 2018), where it brings in the latter idea of a concept into the former’s framework. The novelty seems to be the choice of the adversarial loss, which is the norm of v* -- however, this is not well-motivated. 2. The work seems to be more well-suited to bias removal than OOD generalization. 3. Quantitative baselines that can convince the effectiveness of the proposed method are absent.

Reviewer 02Rating 3· reject, not good enoughConfidence 4

Strengths

- The background on concept activation vectors and adversarial concept removal is easy to follow for readers not familiar with the topics. - The approach is simple and decouples model training and concept erasure, so one can erase concepts in a post-hoc manner using a concept dataset instead of requiring training datapoints to have concept labels. - The experiments in S4 that study the connection between layers and concept removal effectiveness are insightful.

Weaknesses

- The main experiment in S4 only demonstrates RQ1 and RQ2 on MNIST. The MNIST dataset is a good sanity check or starting point, but it is not enough to properly demonstrate the usefulness of this approach. A simple linear model can give 90% accuracy on the MNIST task and it is not representative of modern computer vision tasks. The experiments would be more convincing if the results hold on (a) “harder” tasks such as CIFAR-10 or CIFAR-100 and larger models (e.g., ResNet50 and ViTs) - The experim

Reviewer 03Rating 5· marginally below the acceptance thresholdConfidence 4

Strengths

+ originality: this paper proposed to utilize adversarial linear classifier to gain out-of-distribution robustness on the problem of concept removal

Weaknesses

- lack of comparison with other concept removal baseline methods in Table 2 - lack of comprehensive results on Section 6, this paper does not show the advantage of their method on practical celebrity dataset. - lack of ablation study on their training loss modules, for example, the effect of Penalty term of Eq. 3.2, and different values of \lambda of Eq. 3.1.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsCOVID-19 diagnosis using AI · Anomaly Detection Techniques and Applications · Domain Adaptation and Few-Shot Learning