AtPatch: Debugging Transformers via Hot-Fixing Over-Attention
Shihao Weng, Yang Feng, Jincheng Li, Yining Yin, Xiaofei Xie, Jia Liu

TL;DR
AtPatch is a dynamic hot-fix method that detects and corrects anomalous attention patterns in transformer models during inference, effectively mitigating backdoor attacks and unfairness without retraining.
Contribution
It introduces a novel attention redistribution technique that operates on-the-fly during inference, improving mitigation of malicious and biased attention patterns in transformer models.
Findings
More effective at mitigating backdoor attacks.
Better preserves original model functionality.
Works without retraining or modifying model parameters.
Abstract
Transformer-based deep neural networks (DNNs) affected by backdoor attacks and unfairness typically exhibit anomalous attention patterns, leading to over-attend to backdoor triggers or protected attributes. Existing neuron-editing mitigation strategies often struggle to handle such situation and most of them lack flexibility and tend to distort feature representations. Motivated by such over-attention phenomenon and software engineering paradigms such as delta debugging and hot patching, we propose AtPatch, a hot-fix method that dynamically redistributes attention maps during model inference. Specifically, for a given input, AtPatch first extracts the attention map from the model's inference process. Then, it uses a pre-trained detector to identify anomalous columns and replace them with unified benign attention. Then, AtPatch rescales other columns to mitigate the impact of…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsAdversarial Robustness in Machine Learning · Security and Verification in Computing · Advanced Malware Detection Techniques
