Flamingo: a Visual Language Model for Few-Shot Learning

Jean-Baptiste Alayrac; Jeff Donahue; Pauline Luc; Antoine Miech; Iain; Barr; Yana Hasson; Karel Lenc; Arthur Mensch; Katie Millican; Malcolm; Reynolds; Roman Ring; Eliza Rutherford; Serkan Cabi; Tengda Han; Zhitao Gong,; Sina Samangooei; Marianne Monteiro; Jacob Menick; Sebastian Borgeaud; Andrew; Brock; Aida Nematzadeh; Sahand Sharifzadeh; Mikolaj Binkowski; Ricardo; Barreira; Oriol Vinyals; Andrew Zisserman; Karen Simonyan

arXiv:2204.14198·cs.CV·November 17, 2022·1.2k cites

Flamingo: a Visual Language Model for Few-Shot Learning

Jean-Baptiste Alayrac, Jeff Donahue, Pauline Luc, Antoine Miech, Iain, Barr, Yana Hasson, Karel Lenc, Arthur Mensch, Katie Millican, Malcolm, Reynolds, Roman Ring, Eliza Rutherford, Serkan Cabi, Tengda Han, Zhitao Gong,, Sina Samangooei, Marianne Monteiro, Jacob Menick

PDF

Open Access 5 Repos 10 Models 1 Video

TL;DR

Flamingo is a versatile visual language model capable of few-shot learning across diverse image and video tasks, achieved through innovative architecture that integrates vision and language models and trained on large-scale multimodal data.

Contribution

The paper introduces Flamingo, a novel multimodal model architecture that enables rapid adaptation to new tasks with minimal examples, outperforming heavily fine-tuned models.

Findings

01

Achieves state-of-the-art few-shot performance on multiple benchmarks.

02

Outperforms models trained on much larger task-specific datasets.

03

Handles both images and videos seamlessly in a unified framework.

Abstract

Building models that can be rapidly adapted to novel tasks using only a handful of annotated examples is an open challenge for multimodal machine learning research. We introduce Flamingo, a family of Visual Language Models (VLM) with this ability. We propose key architectural innovations to: (i) bridge powerful pretrained vision-only and language-only models, (ii) handle sequences of arbitrarily interleaved visual and textual data, and (iii) seamlessly ingest images or videos as inputs. Thanks to their flexibility, Flamingo models can be trained on large-scale multimodal web corpora containing arbitrarily interleaved text and images, which is key to endow them with in-context few-shot learning capabilities. We perform a thorough evaluation of our models, exploring and measuring their ability to rapidly adapt to a variety of image and video tasks. These include open-ended tasks such as…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

Models

Videos

Flamingo: a Visual Language Model for Few-Shot Learning· slideslive

Taxonomy

TopicsMultimodal Machine Learning Applications · Domain Adaptation and Few-Shot Learning · Advanced Image and Video Retrieval Techniques