Improved Robustness of Vision Transformer via PreLayerNorm in Patch Embedding
Bum Jun Kim, Hyeyeon Choi, Hyeonah Jang, Dong Gu Lee, Wonseok Jeong,, Sang Woo Kim

TL;DR
This paper enhances Vision Transformer robustness by introducing PreLayerNorm in patch embedding, addressing scale-invariance issues, and demonstrating improved performance across various image corruptions, especially contrast variations.
Contribution
The paper proposes a novel PreLayerNorm patch embedding method for ViT, improving its robustness to scale changes and contrast variations compared to standard ViT.
Findings
ViT with PreLayerNorm outperforms standard ViT in robustness tests.
PreLayerNorm mitigates performance degradation caused by contrast enhancement.
ViT with PreLayerNorm maintains higher accuracy across various image corruptions.
Abstract
Vision transformers (ViTs) have recently demonstrated state-of-the-art performance in a variety of vision tasks, replacing convolutional neural networks (CNNs). Meanwhile, since ViT has a different architecture than CNN, it may behave differently. To investigate the reliability of ViT, this paper studies the behavior and robustness of ViT. We compared the robustness of CNN and ViT by assuming various image corruptions that may appear in practical vision tasks. We confirmed that for most image transformations, ViT showed robustness comparable to CNN or more improved. However, for contrast enhancement, severe performance degradations were consistently observed in ViT. From a detailed analysis, we identified a potential problem: positional embedding in ViT's patch embedding could work improperly when the color scale changes. Here we claim the use of PreLayerNorm, a modified patch embedding…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsAdvanced Neural Network Applications · Visual Attention and Saliency Detection · Advanced Memory and Neural Computing
MethodsLayer Normalization · Vision Transformer
