FLAVA: A Foundational Language And Vision Alignment Model
Amanpreet Singh, Ronghang Hu, Vedanuj Goswami, Guillaume Couairon,, Wojciech Galuba, Marcus Rohrbach, Douwe Kiela

TL;DR
FLAVA is a comprehensive vision and language foundation model that unifies multiple modalities and tasks, achieving strong performance across 35 diverse vision, language, and cross-modal benchmarks.
Contribution
The paper introduces FLAVA, a novel universal model that simultaneously handles vision, language, and their interactions, addressing limitations of prior modality-specific models.
Findings
Achieves state-of-the-art results on 35 vision and language tasks.
Demonstrates effective multi-modal and cross-modal understanding.
Shows versatility across diverse downstream applications.
Abstract
State-of-the-art vision and vision-and-language models rely on large-scale visio-linguistic pretraining for obtaining good performance on a variety of downstream tasks. Generally, such models are often either cross-modal (contrastive) or multi-modal (with earlier fusion) but not both; and they often only target specific modalities or tasks. A promising direction would be to use a single holistic universal model, as a "foundation", that targets all modalities at once -- a true vision and language foundation model should be good at vision tasks, language tasks, and cross- and multi-modal vision and language tasks. We introduce FLAVA as such a model and demonstrate impressive performance on a wide range of 35 tasks spanning these target modalities.
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsMultimodal Machine Learning Applications · Domain Adaptation and Few-Shot Learning · Natural Language Processing Techniques
MethodsFLAVA
