CAIT: Triple-Win Compression towards High Accuracy, Fast Inference, and Favorable Transferability For ViTs

Ao Wang; Hui Chen; Zijia Lin; Sicheng Zhao; Jungong Han; Guiguang Ding

arXiv:2309.15755·cs.CV·October 20, 2025

CAIT: Triple-Win Compression towards High Accuracy, Fast Inference, and Favorable Transferability For ViTs

Ao Wang, Hui Chen, Zijia Lin, Sicheng Zhao, Jungong Han, Guiguang Ding

PDF

Open Access

TL;DR

CAIT is a novel compression method for Vision Transformers that effectively reduces redundancy, maintains high accuracy, accelerates inference, and ensures transferability to downstream vision tasks by combining token merging and dynamic channel pruning.

Contribution

The paper introduces CAIT, a joint compression approach that combines asymmetric token merging and dynamic channel pruning to improve ViT efficiency without sacrificing performance.

Findings

01

Achieves state-of-the-art compression performance on multiple benchmarks.

02

Maintains high accuracy while significantly reducing inference time.

03

Ensures effective transferability to downstream tasks like semantic segmentation.

Abstract

Vision Transformers (ViTs) have emerged as state-of-the-art models for various vision tasks recently. However, their heavy computation costs remain daunting for resource-limited devices. To address this, researchers have dedicated themselves to compressing redundant information in ViTs for acceleration. However, existing approaches generally sparsely drop redundant image tokens by token pruning or brutally remove channels by channel pruning, leading to a sub-optimal balance between model performance and inference speed. Moreover, they struggle when transferring compressed models to downstream vision tasks that require the spatial structure of images, such as semantic segmentation. To tackle these issues, we propose CAIT, a joint \underline{c}ompression method for ViTs that achieves a harmonious blend of high \underline{a}ccuracy, fast \underline{i}nference speed, and favorable…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsAdvanced Neural Network Applications · Domain Adaptation and Few-Shot Learning · Cell Image Analysis Techniques

MethodsPruning