SwinStyleformer is a favorable choice for image inversion
Jiawei Mao, Guangyi Zhao, Xuesong Yin, Yuanqi Chang

TL;DR
SwinStyleformer introduces a pure Transformer-based image inversion network that effectively captures local details and global structure, outperforming CNN-based methods by addressing their limitations.
Contribution
The paper presents SwinStyleformer, a novel Transformer-based inversion network with multi-scale connections and learnable query blocks, achieving state-of-the-art results in image inversion.
Findings
Successfully addresses Transformer inversion failure.
Achieves state-of-the-art performance in image inversion.
Enhances local detail and global structure understanding.
Abstract
This paper proposes the first pure Transformer structure inversion network called SwinStyleformer, which can compensate for the shortcomings of the CNNs inversion framework by handling long-range dependencies and learning the global structure of objects. Experiments found that the inversion network with the Transformer backbone could not successfully invert the image. The above phenomena arise from the differences between CNNs and Transformers, such as the self-attention weights favoring image structure ignoring image details compared to convolution, the lack of multi-scale properties of Transformer, and the distribution differences between the latent code extracted by the Transformer and the StyleGAN style vector. To address these differences, we employ the Swin Transformer with a smaller window size as the backbone of the SwinStyleformer to enhance the local detail of the inversion…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsUltrasound Imaging and Elastography · Infrared Thermography in Medicine
MethodsLinear Layer · Stochastic Depth · Multi-Head Attention · Residual Connection · Convolution · Softmax · Layer Normalization · Focus · Byte Pair Encoding · Label Smoothing
