Rectify ViT Shortcut Learning by Visual Saliency

Chong Ma; Lin Zhao; Yuzhong Chen; David Weizhong Liu; Xi Jiang; Tuo; Zhang; Xintao Hu; Dinggang Shen; Dajiang Zhu; Tianming Liu

arXiv:2206.08567·cs.CV·June 20, 2022

Rectify ViT Shortcut Learning by Visual Saliency

Chong Ma, Lin Zhao, Yuzhong Chen, David Weizhong Liu, Xi Jiang, Tuo, Zhang, Xintao Hu, Dinggang Shen, Dajiang Zhu, Tianming Liu

PDF

Open Access

TL;DR

This paper introduces a saliency-guided vision transformer that rectifies shortcut learning by focusing on informative image regions using computational saliency, improving model interpretability and performance without needing eye-gaze data.

Contribution

The proposed SGT model leverages computational visual saliency to guide ViT in avoiding shortcut learning, eliminating the need for labor-intensive eye-gaze data.

Findings

01

Outperforms baseline models on four datasets

02

Effectively rectifies shortcut learning in ViT

03

Enhances interpretability of the model

Abstract

Shortcut learning is common but harmful to deep learning models, leading to degenerated feature representations and consequently jeopardizing the model's generalizability and interpretability. However, shortcut learning in the widely used Vision Transformer framework is largely unknown. Meanwhile, introducing domain-specific knowledge is a major approach to rectifying the shortcuts, which are predominated by background related factors. For example, in the medical imaging field, eye-gaze data from radiologists is an effective human visual prior knowledge that has the great potential to guide the deep learning models to focus on meaningful foreground regions of interest. However, obtaining eye-gaze data is time-consuming, labor-intensive and sometimes even not practical. In this work, we propose a novel and effective saliency-guided vision transformer (SGT) model to rectify shortcut…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsVisual Attention and Saliency Detection · Domain Adaptation and Few-Shot Learning · Advanced Image and Video Retrieval Techniques

MethodsAttention Is All You Need · Linear Layer · Dropout · Position-Wise Feed-Forward Layer · Byte Pair Encoding · Label Smoothing · Absolute Position Encodings · Multi-Head Attention · Adam · Layer Normalization