Flamingo: a Visual Language Model for Few-Shot Learning
Jean-Baptiste Alayrac, Jeff Donahue, Pauline Luc, Antoine Miech, Iain, Barr, Yana Hasson, Karel Lenc, Arthur Mensch, Katie Millican, Malcolm, Reynolds, Roman Ring, Eliza Rutherford, Serkan Cabi, Tengda Han, Zhitao Gong,, Sina Samangooei, Marianne Monteiro, Jacob Menick

TL;DR
Flamingo is a versatile visual language model capable of few-shot learning across diverse image and video tasks, achieved through innovative architecture that integrates vision and language models and trained on large-scale multimodal data.
Contribution
The paper introduces Flamingo, a novel multimodal model architecture that enables rapid adaptation to new tasks with minimal examples, outperforming heavily fine-tuned models.
Findings
Achieves state-of-the-art few-shot performance on multiple benchmarks.
Outperforms models trained on much larger task-specific datasets.
Handles both images and videos seamlessly in a unified framework.
Abstract
Building models that can be rapidly adapted to novel tasks using only a handful of annotated examples is an open challenge for multimodal machine learning research. We introduce Flamingo, a family of Visual Language Models (VLM) with this ability. We propose key architectural innovations to: (i) bridge powerful pretrained vision-only and language-only models, (ii) handle sequences of arbitrarily interleaved visual and textual data, and (iii) seamlessly ingest images or videos as inputs. Thanks to their flexibility, Flamingo models can be trained on large-scale multimodal web corpora containing arbitrarily interleaved text and images, which is key to endow them with in-context few-shot learning capabilities. We perform a thorough evaluation of our models, exploring and measuring their ability to rapidly adapt to a variety of image and video tasks. These include open-ended tasks such as…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
- 🤗HuggingFaceM4/idefics-80bmodel· 331 dl· ♡ 69331 dl♡ 69
- 🤗HuggingFaceM4/idefics-9bmodel· 1.9k dl· ♡ 471.9k dl♡ 47
- 🤗HuggingFaceM4/idefics-9b-instructmodel· 1.2k dl· ♡ 1071.2k dl♡ 107
- 🤗HuggingFaceM4/idefics-80b-instructmodel· 5.3k dl· ♡ 1895.3k dl♡ 189
- 🤗areegtarek/idefics-9b-instruct-allmodel· 12 dl12 dl
- 🤗nvidia/audio-flamingo-2model· ♡ 48♡ 48
- 🤗nvidia/audio-flamingo-2-1.5Bmodel· ♡ 6♡ 6
- 🤗nvidia/audio-flamingo-2-0.5Bmodel· ♡ 13♡ 13
- 🤗Sony/AKI-4B-phi-3.5-minimodel· 10 dl· ♡ 2710 dl♡ 27
- 🤗Lyon28/caca-1M-untrainedmodel· 16 dl16 dl
Videos
Taxonomy
TopicsMultimodal Machine Learning Applications · Domain Adaptation and Few-Shot Learning · Advanced Image and Video Retrieval Techniques
