Compress image to patches for Vision Transformer

Xinfeng Zhao; Yaoru Sun

arXiv:2502.10120·cs.CV·February 18, 2025

Compress image to patches for Vision Transformer

Xinfeng Zhao, Yaoru Sun

PDF

Open Access 1 Repo

TL;DR

This paper introduces CI2P-ViT, a hybrid CNN and Vision Transformer model that compresses images into patches to reduce computational costs and improve accuracy, demonstrating significant efficiency and performance gains.

Contribution

The paper presents a novel image compression-based patch generation method for ViT, reducing computational load and enhancing accuracy by integrating CNN inductive biases.

Findings

01

Achieved 92.37% accuracy on Animals-10, a 3.3% improvement over baseline.

02

Reduced FLOPs by 63.35%, significantly lowering computational costs.

03

Doubled training speed on identical hardware.

Abstract

The Vision Transformer (ViT) has made significant strides in the field of computer vision. However, as the depth of the model and the resolution of the input images increase, the computational cost associated with training and running ViT models has surged dramatically. This paper proposes a hybrid model based on CNN and Vision Transformer, named CI2P-ViT. The model incorporates a module called CI2P, which utilizes the CompressAI encoder to compress images and subsequently generates a sequence of patches through a series of convolutions. CI2P can replace the Patch Embedding component in the ViT model, enabling seamless integration into existing ViT models. Compared to ViT-B/16, CI2P-ViT has the number of patches input to the self-attention layer reduced to a quarter of the original. This design not only significantly reduces the computational cost of the ViT model but also effectively…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

fanchy/ci2pvit
pytorchOfficial

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsImage Processing Techniques and Applications · CCD and CMOS Imaging Sensors

MethodsAttention Is All You Need · Linear Layer · Dense Connections · Multi-Head Attention · Position-Wise Feed-Forward Layer · Adam · Softmax · Dropout · Vision Transformer · Label Smoothing