Perceiver-VL: Efficient Vision-and-Language Modeling with Iterative   Latent Attention

Zineng Tang; Jaemin Cho; Jie Lei; Mohit Bansal

arXiv:2211.11701·cs.CV·November 22, 2022·1 cites

Perceiver-VL: Efficient Vision-and-Language Modeling with Iterative Latent Attention

Zineng Tang, Jaemin Cho, Jie Lei, Mohit Bansal

PDF

Open Access 1 Repo 1 Video

TL;DR

Perceiver-VL introduces an efficient vision-and-language framework utilizing iterative latent cross-attention, achieving scalable multimodal processing with reduced computational complexity and competitive performance on various benchmarks.

Contribution

It presents a novel scalable multimodal model with linear complexity using iterative latent attention and explores efficiency improvements like LayerDrop and mixed-stream architecture.

Findings

01

Achieves lowest GFLOPs and latency among benchmarks.

02

Maintains competitive performance with high efficiency.

03

Provides extensive analysis of model components and strategies.

Abstract

We present Perceiver-VL, a vision-and-language framework that efficiently handles high-dimensional multimodal inputs such as long videos and text. Powered by the iterative latent cross-attention of Perceiver, our framework scales with linear complexity, in contrast to the quadratic complexity of self-attention used in many state-of-the-art transformer-based models. To further improve the efficiency of our framework, we also study applying LayerDrop on cross-attention layers and introduce a mixed-stream architecture for cross-modal retrieval. We evaluate Perceiver-VL on diverse video-text and image-text benchmarks, where Perceiver-VL achieves the lowest GFLOPs and latency while maintaining competitive performance. In addition, we also provide comprehensive analyses of various aspects of our framework, including pretraining data, scalability of latent size and input size, dropping…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

zinengtang/perceiver_vl
pytorchOfficial

Videos

Perceiver-VL: Efficient Vision-and-Language Modeling with Iterative Latent Attention· youtube

Taxonomy

TopicsMultimodal Machine Learning Applications · Domain Adaptation and Few-Shot Learning · Advanced Image and Video Retrieval Techniques

MethodsLayerDrop