Backward Compatibility in Attributive Explanation and Enhanced Model Training Method
Ryuta Matsuno

TL;DR
This paper introduces BCX, a metric for evaluating explanation consistency after model updates, and BCXR, a training method to improve explanation backward compatibility while maintaining predictive accuracy.
Contribution
The paper proposes BCX as a new quantitative metric for explanation backward compatibility and BCXR as a training method to enhance this compatibility in model updates.
Findings
BCXR improves explanation consistency across models.
BCXR maintains high predictive performance.
BCXR outperforms baseline methods in experiments.
Abstract
Model update is a crucial process in the operation of ML/AI systems. While updating a model generally enhances the average prediction performance, it also significantly impacts the explanations of predictions. In real-world applications, even minor changes in explanations can have detrimental consequences. To tackle this issue, this paper introduces BCX, a quantitative metric that evaluates the backward compatibility of feature attribution explanations between pre- and post-update models. BCX utilizes practical agreement metrics to calculate the average agreement between the explanations of pre- and post-update models, specifically among samples on which both models accurately predict. In addition, we propose BCXR, a BCX-aware model training method by designing surrogate losses which theoretically lower bounds agreement scores. Furthermore, we present a universal variant of BCXR that…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsTopic Modeling
