FLAIR: VLM with Fine-grained Language-informed Image Representations

Rui Xiao; Sanghwan Kim; Mariana-Iuliana Georgescu; Zeynep Akata,; Stephan Alaniz

arXiv:2412.03561·cs.CV·December 5, 2024

FLAIR: VLM with Fine-grained Language-informed Image Representations

Rui Xiao, Sanghwan Kim, Mariana-Iuliana Georgescu, Zeynep Akata,, Stephan Alaniz

PDF

Open Access 2 Repos 1 Models

TL;DR

FLAIR enhances vision-language models by learning detailed, localized image representations through fine-grained descriptions, significantly improving retrieval and segmentation tasks over existing models.

Contribution

Introduces FLAIR, a novel approach that leverages detailed image descriptions to produce fine-grained, text-conditioned image embeddings for improved multimodal understanding.

Findings

01

Achieves state-of-the-art results on multimodal retrieval benchmarks.

02

Outperforms models trained on billions of pairs in fine-grained tasks.

03

Demonstrates effectiveness in zero-shot semantic segmentation.

Abstract

CLIP has shown impressive results in aligning images and texts at scale. However, its ability to capture detailed visual features remains limited because CLIP matches images and texts at a global level. To address this issue, we propose FLAIR, Fine-grained Language-informed Image Representations, an approach that utilizes long and detailed image descriptions to learn localized image embeddings. By sampling diverse sub-captions that describe fine-grained details about an image, we train our vision-language model to produce not only global embeddings but also text-specific image representations. Our model introduces text-conditioned attention pooling on top of local image tokens to produce fine-grained image representations that excel at retrieving detailed image content. We achieve state-of-the-art performance on both, existing multimodal retrieval benchmarks, as well as, our newly…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

Models

🤗
xiaorui638/flair
model· 149 dl· ♡ 7
149 dl♡ 7

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsMultimodal Machine Learning Applications · Advanced Neural Network Applications · Advanced Image and Video Retrieval Techniques

MethodsSoftmax · Attention Is All You Need · Attention Pooling · Contrastive Language-Image Pre-training