FIX-CLIP: Dual-Branch Hierarchical Contrastive Learning via Synthetic Captions for Better Understanding of Long Text

Bingchao Wang; Zhiwei Ning; Jianyu Ding; Xuanang Gao; Yin Li; Dongsheng Jiang; Jie Yang; Wei Liu

arXiv:2507.10095·cs.CV·July 30, 2025

FIX-CLIP: Dual-Branch Hierarchical Contrastive Learning via Synthetic Captions for Better Understanding of Long Text

Bingchao Wang, Zhiwei Ning, Jianyu Ding, Xuanang Gao, Yin Li, Dongsheng Jiang, Jie Yang, Wei Liu

PDF

TL;DR

FIX-CLIP introduces a dual-branch hierarchical contrastive learning framework with synthetic captions, significantly improving long-text understanding in CLIP models while maintaining short-text capabilities, and achieves state-of-the-art results on retrieval benchmarks.

Contribution

The paper proposes a novel dual-branch training pipeline, regional prompts, and hierarchical feature alignment to enhance long-text comprehension in CLIP, utilizing synthetic captions for large-scale training.

Findings

01

Achieves state-of-the-art performance on long-text retrieval benchmarks.

02

Enhances long-text representation without sacrificing short-text ability.

03

Demonstrates promising downstream application performance in diffusion models.

Abstract

CLIP has shown promising performance across many short-text tasks in a zero-shot manner. However, limited by the input length of the text encoder, CLIP struggles on under-stream tasks with long-text inputs ( $> 77$ tokens). To remedy this issue, we propose FIX-CLIP, which includes three novel modules: (1) A dual-branch training pipeline that aligns short and long texts with masked and raw images, respectively, which boosts the long-text representation while preserving the short-text ability. (2) Multiple learnable regional prompts with unidirectional masks in Transformer layers for regional information extraction. (3) A hierarchical feature alignment module in the intermediate encoder layers to promote the consistency of multi-scale features. Furthermore, we collect 30M images and utilize existing MLLMs to synthesize long-text captions for training. Extensive experiments show that…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.