FILIP: Fine-grained Interactive Language-Image Pre-Training

Lewei Yao; Runhui Huang; Lu Hou; Guansong Lu; Minzhe Niu; Hang Xu,; Xiaodan Liang; Zhenguo Li; Xin Jiang; Chunjing Xu

arXiv:2111.07783·cs.CV·November 16, 2021·205 cites

FILIP: Fine-grained Interactive Language-Image Pre-Training

Lewei Yao, Runhui Huang, Lu Hou, Guansong Lu, Minzhe Niu, Hang Xu,, Xiaodan Liang, Zhenguo Li, Xin Jiang, Chunjing Xu

PDF

Open Access 1 Repo 1 Video

TL;DR

FILIP introduces a scalable, efficient fine-grained cross-modal pre-training method that improves alignment between image patches and textual words, achieving state-of-the-art results on vision-language tasks.

Contribution

The paper proposes a novel late interaction mechanism for fine-grained alignment in vision-language pre-training, enabling efficient offline representation computation.

Findings

01

FILIP achieves state-of-the-art performance on multiple downstream tasks.

02

The method effectively learns meaningful fine-grained features and localization.

03

FILIP's approach maintains efficiency in large-scale training and inference.

Abstract

Unsupervised large-scale vision-language pre-training has shown promising advances on various downstream tasks. Existing methods often model the cross-modal interaction either via the similarity of the global feature of each modality which misses sufficient information, or finer-grained interactions using cross/self-attention upon visual and textual tokens. However, cross/self-attention suffers from inferior efficiency in both training and inference. In this paper, we introduce a large-scale Fine-grained Interactive Language-Image Pre-training (FILIP) to achieve finer-level alignment through a cross-modal late interaction mechanism, which uses a token-wise maximum similarity between visual and textual tokens to guide the contrastive objective. FILIP successfully leverages the finer-grained expressiveness between image patches and textual words by modifying only contrastive loss, while…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

lucidrains/x-clip
pytorch

Videos

FILIP: Fine-grained Interactive Language-Image Pre-Training· slideslive

Taxonomy

TopicsMultimodal Machine Learning Applications · Domain Adaptation and Few-Shot Learning · Topic Modeling