FLAIR: VLM with Fine-grained Language-informed Image Representations
Rui Xiao, Sanghwan Kim, Mariana-Iuliana Georgescu, Zeynep Akata,, Stephan Alaniz

TL;DR
FLAIR enhances vision-language models by learning detailed, localized image representations through fine-grained descriptions, significantly improving retrieval and segmentation tasks over existing models.
Contribution
Introduces FLAIR, a novel approach that leverages detailed image descriptions to produce fine-grained, text-conditioned image embeddings for improved multimodal understanding.
Findings
Achieves state-of-the-art results on multimodal retrieval benchmarks.
Outperforms models trained on billions of pairs in fine-grained tasks.
Demonstrates effectiveness in zero-shot semantic segmentation.
Abstract
CLIP has shown impressive results in aligning images and texts at scale. However, its ability to capture detailed visual features remains limited because CLIP matches images and texts at a global level. To address this issue, we propose FLAIR, Fine-grained Language-informed Image Representations, an approach that utilizes long and detailed image descriptions to learn localized image embeddings. By sampling diverse sub-captions that describe fine-grained details about an image, we train our vision-language model to produce not only global embeddings but also text-specific image representations. Our model introduces text-conditioned attention pooling on top of local image tokens to produce fine-grained image representations that excel at retrieving detailed image content. We achieve state-of-the-art performance on both, existing multimodal retrieval benchmarks, as well as, our newly…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsMultimodal Machine Learning Applications · Advanced Neural Network Applications · Advanced Image and Video Retrieval Techniques
MethodsSoftmax · Attention Is All You Need · Attention Pooling · Contrastive Language-Image Pre-training
