FLAVA: A Foundational Language And Vision Alignment Model

Amanpreet Singh; Ronghang Hu; Vedanuj Goswami; Guillaume Couairon,; Wojciech Galuba; Marcus Rohrbach; Douwe Kiela

arXiv:2112.04482·cs.CV·March 31, 2022·36 cites

FLAVA: A Foundational Language And Vision Alignment Model

Amanpreet Singh, Ronghang Hu, Vedanuj Goswami, Guillaume Couairon,, Wojciech Galuba, Marcus Rohrbach, Douwe Kiela

PDF

Open Access 4 Repos 1 Models 1 Datasets

TL;DR

FLAVA is a comprehensive vision and language foundation model that unifies multiple modalities and tasks, achieving strong performance across 35 diverse vision, language, and cross-modal benchmarks.

Contribution

The paper introduces FLAVA, a novel universal model that simultaneously handles vision, language, and their interactions, addressing limitations of prior modality-specific models.

Findings

01

Achieves state-of-the-art results on 35 vision and language tasks.

02

Demonstrates effective multi-modal and cross-modal understanding.

03

Shows versatility across diverse downstream applications.

Abstract

State-of-the-art vision and vision-and-language models rely on large-scale visio-linguistic pretraining for obtaining good performance on a variety of downstream tasks. Generally, such models are often either cross-modal (contrastive) or multi-modal (with earlier fusion) but not both; and they often only target specific modalities or tasks. A promising direction would be to use a single holistic universal model, as a "foundation", that targets all modalities at once -- a true vision and language foundation model should be good at vision tasks, language tasks, and cross- and multi-modal vision and language tasks. We introduce FLAVA as such a model and demonstrate impressive performance on a wide range of 35 tasks spanning these target modalities.

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

Models

🤗
facebook/flava-full
model· 20k dl· ♡ 43
20k dl♡ 43

Datasets

facebook/pmd
dataset· 37 dl
37 dl

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsMultimodal Machine Learning Applications · Domain Adaptation and Few-Shot Learning · Natural Language Processing Techniques

MethodsFLAVA