Attribution-Guided Model Rectification of Unreliable Neural Network Behaviors
Peiyu Yang, Naveed Akhtar, Jiantong Jiang, Ajmal Mian

TL;DR
This paper introduces an attribution-guided model rectification framework using rank-one model editing to efficiently correct unreliable behaviors in neural networks, reducing data requirements and manual effort.
Contribution
It proposes a novel attribution-guided layer localization method and a formulation that corrects unreliabilities while preserving overall model performance.
Findings
Effective correction of neural Trojans, spurious correlations, and feature leakage.
Achieves model rectification with as few as one cleansed sample.
Reduces reliance on large-scale data cleaning and retraining.
Abstract
The performance of neural network models deteriorates due to their unreliable behavior on non-robust features of corrupted samples. Owing to their opaque nature, rectifying models to address this problem often necessitates arduous data cleaning and model retraining, resulting in huge computational and manual overhead. In this work, we leverage rank-one model editing to establish an attribution-guided model rectification framework that effectively locates and corrects model unreliable behaviors. We first distinguish our rectification setting from existing model editing, yielding a formulation that corrects unreliable behavior while preserving model performance and reducing reliance on large budgets of cleansed samples. We further reveal a bottleneck of model rectifying arising from heterogeneous editability across layers. To target the primary source of misbehavior, we introduce an…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsAdversarial Robustness in Machine Learning · Explainable Artificial Intelligence (XAI) · Generative Adversarial Networks and Image Synthesis
