CrossGET: Cross-Guided Ensemble of Tokens for Accelerating Vision-Language Transformers
Dachuan Shi, Chaofan Tao, Anyi Rao, Zhendong Yang, Chun Yuan, Jiaqi, Wang

TL;DR
CrossGET is a novel framework that adaptively combines tokens during inference to accelerate vision-language Transformers, reducing computational costs while maintaining high performance across multiple tasks.
Contribution
This paper introduces CrossGET, a general acceleration framework with cross-guided token matching and ensemble, applicable to various vision-language models, enhancing efficiency without sacrificing accuracy.
Findings
Significant reduction in computational costs across tasks
Effective token matching mechanism ensuring reliability
Versatile applicability to multiple vision-language architectures
Abstract
Recent vision-language models have achieved tremendous advances. However, their computational costs are also escalating dramatically, making model acceleration exceedingly critical. To pursue more efficient vision-language Transformers, this paper introduces Cross-Guided Ensemble of Tokens (CrossGET), a general acceleration framework for vision-language Transformers. This framework adaptively combines tokens in real-time during inference, significantly reducing computational costs while maintaining high performance. CrossGET features two primary innovations: 1) Cross-Guided Matching and Ensemble. CrossGET leverages cross-modal guided token matching and ensemble to effectively utilize cross-modal information, achieving wider applicability across both modality-independent models, e.g., CLIP, and modality-dependent ones, e.g., BLIP2. 2) Complete-Graph Soft Matching. CrossGET introduces an…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsMultimodal Machine Learning Applications · Domain Adaptation and Few-Shot Learning · Topic Modeling
MethodsAttention Is All You Need · Softmax · Layer Normalization · Byte Pair Encoding · Dropout · Linear Layer · Label Smoothing · Residual Connection · Position-Wise Feed-Forward Layer · Absolute Position Encodings
