FIX-CLIP: Dual-Branch Hierarchical Contrastive Learning via Synthetic Captions for Better Understanding of Long Text
Bingchao Wang, Zhiwei Ning, Jianyu Ding, Xuanang Gao, Yin Li, Dongsheng Jiang, Jie Yang, Wei Liu

TL;DR
FIX-CLIP introduces a dual-branch hierarchical contrastive learning framework with synthetic captions, significantly improving long-text understanding in CLIP models while maintaining short-text capabilities, and achieves state-of-the-art results on retrieval benchmarks.
Contribution
The paper proposes a novel dual-branch training pipeline, regional prompts, and hierarchical feature alignment to enhance long-text comprehension in CLIP, utilizing synthetic captions for large-scale training.
Findings
Achieves state-of-the-art performance on long-text retrieval benchmarks.
Enhances long-text representation without sacrificing short-text ability.
Demonstrates promising downstream application performance in diffusion models.
Abstract
CLIP has shown promising performance across many short-text tasks in a zero-shot manner. However, limited by the input length of the text encoder, CLIP struggles on under-stream tasks with long-text inputs ( tokens). To remedy this issue, we propose FIX-CLIP, which includes three novel modules: (1) A dual-branch training pipeline that aligns short and long texts with masked and raw images, respectively, which boosts the long-text representation while preserving the short-text ability. (2) Multiple learnable regional prompts with unidirectional masks in Transformer layers for regional information extraction. (3) A hierarchical feature alignment module in the intermediate encoder layers to promote the consistency of multi-scale features. Furthermore, we collect 30M images and utilize existing MLLMs to synthesize long-text captions for training. Extensive experiments show that…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
