Curved Representation Space of Vision Transformers
Juyeop Kim, Junha Park, Songkuk Kim, Jong-Seok Lee

TL;DR
This paper investigates the curved representation space of Vision Transformers, revealing how their nonlinear trajectories in feature space contribute to robustness and underconfidence, contrasting with CNNs.
Contribution
It provides empirical analysis of the nonlinear, curved trajectories in Transformer representations and links these to robustness and prediction confidence.
Findings
Transformers exhibit nonlinear, curved trajectories in representation space.
Curved regions in the space hinder movement out of decision regions, enhancing robustness.
Linear movements near decision boundaries lead to confident, direct predictions.
Abstract
Neural networks with self-attention (a.k.a. Transformers) like ViT and Swin have emerged as a better alternative to traditional convolutional neural networks (CNNs). However, our understanding of how the new architecture works is still limited. In this paper, we focus on the phenomenon that Transformers show higher robustness against corruptions than CNNs, while not being overconfident. This is contrary to the intuition that robustness increases with confidence. We resolve this contradiction by empirically investigating how the output of the penultimate layer moves in the representation space as the input data moves linearly within a small area. In particular, we show the following. (1) While CNNs exhibit fairly linear relationship between the input and output movements, Transformers show nonlinear relationship for some data. For those data, the output of Transformers moves in a curved…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
Taxonomy
TopicsNeural Networks and Applications · Advanced Neural Network Applications · Adversarial Robustness in Machine Learning
