COMCAT: Towards Efficient Compression and Customization of   Attention-Based Vision Models

Jinqi Xiao; Miao Yin; Yu Gong; Xiao Zang; Jian Ren; Bo Yuan

arXiv:2305.17235·cs.CV·December 4, 2024·1 cites

COMCAT: Towards Efficient Compression and Customization of Attention-Based Vision Models

Jinqi Xiao, Miao Yin, Yu Gong, Xiao Zang, Jian Ren, Bo Yuan

PDF

Open Access 1 Repo

TL;DR

This paper introduces a novel, efficient compression method for attention-based vision models like ViT, outperforming pruning techniques and enabling faster training and reduced storage for text-to-image models.

Contribution

The paper proposes a new ViT compression approach based on multi-head attention insights, improving accuracy and efficiency over existing pruning methods.

Findings

01

Achieves higher top-1 accuracy on ImageNet with fewer parameters.

02

Enables up to 2.6x faster training of diffusion models.

03

Reduces storage costs by up to 1927.5 times.

Abstract

Attention-based vision models, such as Vision Transformer (ViT) and its variants, have shown promising performance in various computer vision tasks. However, these emerging architectures suffer from large model sizes and high computational costs, calling for efficient model compression solutions. To date, pruning ViTs has been well studied, while other compression strategies that have been widely applied in CNN compression, e.g., model factorization, is little explored in the context of ViT compression. This paper explores an efficient method for compressing vision transformers to enrich the toolset for obtaining compact attention-based vision models. Based on the new insight on the multi-head attention layer, we develop a highly efficient ViT compression solution, which outperforms the state-of-the-art pruning methods. For compressing DeiT-small and DeiT-base models on ImageNet, our…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

jinqixiao/ComCAT
pytorchOfficial

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsAdvanced Image and Video Retrieval Techniques · Advanced Neural Network Applications · Domain Adaptation and Few-Shot Learning

MethodsMulti-Head Attention · Attention Is All You Need · Pruning · Softmax · Layer Normalization · Byte Pair Encoding · Dropout · Linear Layer · Label Smoothing · Diffusion