Compress image to patches for Vision Transformer
Xinfeng Zhao, Yaoru Sun

TL;DR
This paper introduces CI2P-ViT, a hybrid CNN and Vision Transformer model that compresses images into patches to reduce computational costs and improve accuracy, demonstrating significant efficiency and performance gains.
Contribution
The paper presents a novel image compression-based patch generation method for ViT, reducing computational load and enhancing accuracy by integrating CNN inductive biases.
Findings
Achieved 92.37% accuracy on Animals-10, a 3.3% improvement over baseline.
Reduced FLOPs by 63.35%, significantly lowering computational costs.
Doubled training speed on identical hardware.
Abstract
The Vision Transformer (ViT) has made significant strides in the field of computer vision. However, as the depth of the model and the resolution of the input images increase, the computational cost associated with training and running ViT models has surged dramatically. This paper proposes a hybrid model based on CNN and Vision Transformer, named CI2P-ViT. The model incorporates a module called CI2P, which utilizes the CompressAI encoder to compress images and subsequently generates a sequence of patches through a series of convolutions. CI2P can replace the Patch Embedding component in the ViT model, enabling seamless integration into existing ViT models. Compared to ViT-B/16, CI2P-ViT has the number of patches input to the self-attention layer reduced to a quarter of the original. This design not only significantly reduces the computational cost of the ViT model but also effectively…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsImage Processing Techniques and Applications · CCD and CMOS Imaging Sensors
MethodsAttention Is All You Need · Linear Layer · Dense Connections · Multi-Head Attention · Position-Wise Feed-Forward Layer · Adam · Softmax · Dropout · Vision Transformer · Label Smoothing
