BUS:Efficient and Effective Vision-language Pre-training with Bottom-Up   Patch Summarization

Chaoya Jiang; Haiyang Xu; Wei Ye; Qinghao Ye; Chenliang Li; Ming Yan,; Bin Bi; Shikun Zhang; Fei Huang; Songfang Huang

arXiv:2307.08504·cs.CV·February 27, 2024·1 cites

BUS:Efficient and Effective Vision-language Pre-training with Bottom-Up Patch Summarization

Chaoya Jiang, Haiyang Xu, Wei Ye, Qinghao Ye, Chenliang Li, Ming Yan,, Bin Bi, Shikun Zhang, Fei Huang, Songfang Huang

PDF

Open Access

TL;DR

BUS introduces a bottom-up patch summarization method for vision-language pre-training, combining bottom-level patch selection and top-level abstraction to improve training efficiency and maintain high task performance.

Contribution

The paper proposes a novel Bottom-Up Patch Summarization approach that balances training efficiency and effectiveness in vision-language models by integrating patch selection and abstraction.

Findings

01

Boosts training efficiency by 50%

02

Achieves state-of-the-art results on multiple downstream tasks

03

Maintains or improves performance with higher input resolutions

Abstract

Vision Transformer (ViT) based Vision-Language Pre-training (VLP) models have demonstrated impressive performance in various tasks. However, the lengthy visual token sequences fed into ViT can lead to training inefficiency and ineffectiveness. Existing efforts address the challenge by either bottom-level patch extraction in the ViT backbone or top-level patch abstraction outside, not balancing training efficiency and effectiveness well. Inspired by text summarization in natural language processing, we propose a Bottom-Up Patch Summarization approach named BUS, coordinating bottom-level extraction and top-level abstraction to learn a concise summary of lengthy visual token sequences efficiently. Specifically, We incorporate a Text-Semantics-Aware Patch Selector (TSPS) into the ViT backbone to perform a coarse-grained visual token extraction and then attach a flexible Transformer-based…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsMultimodal Machine Learning Applications · Domain Adaptation and Few-Shot Learning · Advanced Image and Video Retrieval Techniques

MethodsMulti-Head Attention · Attention Is All You Need · Linear Layer · Byte Pair Encoding · Position-Wise Feed-Forward Layer · Label Smoothing · Residual Connection · Absolute Position Encodings · Adam · Layer Normalization