Improved Robustness of Vision Transformer via PreLayerNorm in Patch   Embedding

Bum Jun Kim; Hyeyeon Choi; Hyeonah Jang; Dong Gu Lee; Wonseok Jeong,; Sang Woo Kim

arXiv:2111.08413·cs.CV·November 17, 2021

Improved Robustness of Vision Transformer via PreLayerNorm in Patch Embedding

Bum Jun Kim, Hyeyeon Choi, Hyeonah Jang, Dong Gu Lee, Wonseok Jeong,, Sang Woo Kim

PDF

Open Access

TL;DR

This paper enhances Vision Transformer robustness by introducing PreLayerNorm in patch embedding, addressing scale-invariance issues, and demonstrating improved performance across various image corruptions, especially contrast variations.

Contribution

The paper proposes a novel PreLayerNorm patch embedding method for ViT, improving its robustness to scale changes and contrast variations compared to standard ViT.

Findings

01

ViT with PreLayerNorm outperforms standard ViT in robustness tests.

02

PreLayerNorm mitigates performance degradation caused by contrast enhancement.

03

ViT with PreLayerNorm maintains higher accuracy across various image corruptions.

Abstract

Vision transformers (ViTs) have recently demonstrated state-of-the-art performance in a variety of vision tasks, replacing convolutional neural networks (CNNs). Meanwhile, since ViT has a different architecture than CNN, it may behave differently. To investigate the reliability of ViT, this paper studies the behavior and robustness of ViT. We compared the robustness of CNN and ViT by assuming various image corruptions that may appear in practical vision tasks. We confirmed that for most image transformations, ViT showed robustness comparable to CNN or more improved. However, for contrast enhancement, severe performance degradations were consistently observed in ViT. From a detailed analysis, we identified a potential problem: positional embedding in ViT's patch embedding could work improperly when the color scale changes. Here we claim the use of PreLayerNorm, a modified patch embedding…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsAdvanced Neural Network Applications · Visual Attention and Saliency Detection · Advanced Memory and Neural Computing

MethodsLayer Normalization · Vision Transformer