FG-CLIP: Fine-Grained Visual and Textual Alignment

Chunyu Xie; Bin Wang; Fanjing Kong; Jincheng Li; Dawei Liang; Gengshen Zhang; Dawei Leng; Yuhui Yin

arXiv:2505.05071·cs.CV·May 22, 2025

FG-CLIP: Fine-Grained Visual and Textual Alignment

Chunyu Xie, Bin Wang, Fanjing Kong, Jincheng Li, Dawei Liang, Gengshen Zhang, Dawei Leng, Yuhui Yin

PDF

1 Repo 5 Models 5 Datasets

TL;DR

FG-CLIP significantly improves fine-grained visual and textual understanding by leveraging large-scale data, detailed annotations, and hard negative samples, outperforming existing models in multiple multimodal tasks.

Contribution

The paper introduces FG-CLIP, a novel approach that enhances fine-grained multimodal understanding through new datasets, data augmentation strategies, and training techniques.

Findings

01

FG-CLIP outperforms original CLIP in fine-grained tasks

02

Achieves superior results in open-vocabulary detection

03

Demonstrates improved image-text retrieval accuracy

Abstract

Contrastive Language-Image Pre-training (CLIP) excels in multimodal tasks such as image-text retrieval and zero-shot classification but struggles with fine-grained understanding due to its focus on coarse-grained short captions. To address this, we propose Fine-Grained CLIP (FG-CLIP), which enhances fine-grained understanding through three key innovations. First, we leverage large multimodal models to generate 1.6 billion long caption-image pairs for capturing global-level semantic details. Second, a high-quality dataset is constructed with 12 million images and 40 million region-specific bounding boxes aligned with detailed captions to ensure precise, context-rich representations. Third, 10 million hard fine-grained negative samples are incorporated to improve the model's ability to distinguish subtle semantic differences. We construct a comprehensive dataset, termed FineHARD, by…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

360cvgroup/fg-clip
pytorchOfficial

Models

Datasets

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

MethodsFocus · Contrastive Language-Image Pre-training