CrossGET: Cross-Guided Ensemble of Tokens for Accelerating   Vision-Language Transformers

Dachuan Shi; Chaofan Tao; Anyi Rao; Zhendong Yang; Chun Yuan; Jiaqi; Wang

arXiv:2305.17455·cs.CV·June 17, 2024·5 cites

CrossGET: Cross-Guided Ensemble of Tokens for Accelerating Vision-Language Transformers

Dachuan Shi, Chaofan Tao, Anyi Rao, Zhendong Yang, Chun Yuan, Jiaqi, Wang

PDF

Open Access 1 Repo

TL;DR

CrossGET is a novel framework that adaptively combines tokens during inference to accelerate vision-language Transformers, reducing computational costs while maintaining high performance across multiple tasks.

Contribution

This paper introduces CrossGET, a general acceleration framework with cross-guided token matching and ensemble, applicable to various vision-language models, enhancing efficiency without sacrificing accuracy.

Findings

01

Significant reduction in computational costs across tasks

02

Effective token matching mechanism ensuring reliability

03

Versatile applicability to multiple vision-language architectures

Abstract

Recent vision-language models have achieved tremendous advances. However, their computational costs are also escalating dramatically, making model acceleration exceedingly critical. To pursue more efficient vision-language Transformers, this paper introduces Cross-Guided Ensemble of Tokens (CrossGET), a general acceleration framework for vision-language Transformers. This framework adaptively combines tokens in real-time during inference, significantly reducing computational costs while maintaining high performance. CrossGET features two primary innovations: 1) Cross-Guided Matching and Ensemble. CrossGET leverages cross-modal guided token matching and ensemble to effectively utilize cross-modal information, achieving wider applicability across both modality-independent models, e.g., CLIP, and modality-dependent ones, e.g., BLIP2. 2) Complete-Graph Soft Matching. CrossGET introduces an…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

sdc17/crossget
pytorchOfficial

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsMultimodal Machine Learning Applications · Domain Adaptation and Few-Shot Learning · Topic Modeling

MethodsAttention Is All You Need · Softmax · Layer Normalization · Byte Pair Encoding · Dropout · Linear Layer · Label Smoothing · Residual Connection · Position-Wise Feed-Forward Layer · Absolute Position Encodings