Unlearning-based Neural Interpretations

Ching Lam Choi; Alexandre Duplessis; Serge Belongie

arXiv:2410.08069·cs.LG·February 12, 2025

Unlearning-based Neural Interpretations

Ching Lam Choi, Alexandre Duplessis, Serge Belongie

PDF

Open Access 3 Reviews

TL;DR

This paper introduces UNI, an unlearning-based method for neural interpretation that creates adaptive baselines to improve the faithfulness and robustness of gradient-based attribution maps by erasing salient features and smoothing decision boundaries.

Contribution

It proposes a novel unlearning approach to generate reliable, debiased, and adaptive baselines for gradient-based interpretations, addressing limitations of static baseline methods.

Findings

01

UNI effectively erases salient features in attribution maps.

02

The method produces more faithful and robust explanations.

03

UNI smooths high-curvature decision boundaries for better interpretability.

Abstract

Gradient-based interpretations often require an anchor point of comparison to avoid saturation in computing feature importance. We show that current baselines defined using static functions--constant mapping, averaging or blurring--inject harmful colour, texture or frequency assumptions that deviate from model behaviour. This leads to accumulation of irregular gradients, resulting in attribution maps that are biased, fragile and manipulable. Departing from the static approach, we propose UNI to compute an (un)learnable, debiased and adaptive baseline by perturbing the input towards an unlearning direction of steepest ascent. Our method discovers reliable baselines and succeeds in erasing salient features, which in turn locally smooths the high-curvature decision boundaries. Our analyses point to unlearning as a promising avenue for generating faithful, efficient and robust…

Peer Reviews

Decision·ICLR 2025 Oral

Reviewer 01Rating 8Confidence 4

Strengths

Authors well-summarize the previous search, and clearly explain the required properties needed for the reference. The intution is clearly written, and well-evidenced by experimental results.

Weaknesses

1. Even though they effectively present their intuition, there is no theoretical justification for their method. In particular, the pseudo-code is not explained enough. Additionally, it is unclear why the authors chose to implement it in the way they described in the code. The method itself needs further elaboration. 2. The suggested method is limited to explainable methods that require a reference, which restricts the broader applicability of the paper.

Reviewer 02Rating 10Confidence 4

Strengths

The paper offers a well-structured analysis of an important problem in neural network interpretability. The authors identify a specific, previously overlooked issue: how static baseline approaches in attribution methods can introduce unintended biases in three distinct categories - color (shown through experiments with brightness/saturation changes), texture (demonstrated via gaussian/defocus blur tests), and frequency (validated through gaussian/shot noise experiments). This observation is me

Weaknesses

Largely this is a well written paper, here are a few potential avenues for improvement : - Would it be possible to make stronger theoretical guarantees about the optimality of the unlearned baseline? - Image classification baselines are pretty standard, I wonder how these baselines will look like in other domains such as NLP and audio. - There could be some more discussion around hyperparameters, an often overlooked aspect in explainability literature - It is mentioned that other techniques are

Reviewer 03Rating 8Confidence 4

Strengths

1. **Innovative Baseline Approach**: The idea of using unlearning to create adaptive baselines is novel and addresses a known limitation in static baseline approaches. 2. **Robustness and Faithfulness**: The empirical results, including MuFidelity scores and robustness to adversarial perturbations, highlight the potential of the method to produce more reliable attributions. 3. **Comprehensive Experiments**: The paper includes evaluations on multiple models and datasets, adding credibility to the

Weaknesses

However, I spotted some problems, some major (**M**) and minor (**m**): **M1.** Literature Gaps: the related work section omits significant prior research on black-box interpretability (RISE, Sobol...) and miss some really important faithfulness metrics. Also a paper from last year that could be relevant (Saliency strike back) could be discussed. **M2.** Faithfulness Metrics: while the paper discusses MuFidelity as a metric, it does not consider complementary metrics such as deletion and inser

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsNeural Networks and Applications