MulCLIP: A Multi-level Alignment Framework for Enhancing Fine-grained Long-context CLIP
Chau Truong, Hieu Ta Quang, Dung D. Le

TL;DR
MulCLIP is a multi-level alignment framework that enhances vision-language models' ability to understand detailed long captions by combining global and fine-grained alignment strategies, improving performance on various benchmarks.
Contribution
The paper introduces MulCLIP, an end-to-end framework that extends CLIP with multi-scale alignment techniques for better handling long, detailed texts.
Findings
Improves downstream performance across diverse benchmarks.
Outperforms region-proposal-assisted approaches in fine-grained understanding.
Enhances semantic connections between words and image patches.
Abstract
Vision-language models like CLIP show impressive ability to align images and text, but their training on short, concise captions makes them struggle with lengthy, detailed descriptions. Recent advances mitigate this challenge by leveraging region-proposal information to map visual regions with corresponding sentences from lengthy captions, yet incurring notable deployment costs. We introduce MulCLIP, a novel end-to-end multi-level alignment framework that bridges natural long-text structures with image components. MulCLIP first preserves global contrastive alignment between images and both summary and long captions, while extending positional embeddings for longer text sequences. To further enhance fine-grained understanding, we propose two novel strategies: (1) a token reconstruction alignment over locally calibrated features to strengthen semantic connections between words and image…
Peer Reviews
Decision·Submitted to ICLR 2026
The paper is relatively easy to read, the authors provide a good description of each of the different terms of the MulClip objective. The method generally performs well empirically, at least against the baselines (e.g. Table 2, 3 and 4), and the authors have run a few ablations to confirm the role of the different components of the objective that seem to present mostly a coherent story for the role of each term. The attention maps of MulClip, as shown in Figure 2 do seem sharper and more semanti
The write-up could be somewhat improved as it looks rushed. E.g. Table 1 is not referenced or explained in the text. Equation 6, it is not clear who lower case v tilde is, as far as I can tell the text introduces only upper case V tilde. Is this meant to be v' ? Another issue is understanding the intuitive semantical difference between L_word and L_sub. Are these two losses semantically trying to do the same thing, but are just different objectives of achieving this goal? Are they intuitively/se
1. The self-supervised $L_{Word}$ and the $\mathcal{L}_{Sub}$ (SAP), serves as a replacement for the FILIP loss utilized by FineLIP's CLIM. This modification is remarkably concise and offers a novel solution to the challenge of token-level fine-grained alignment. 2. The experiments are thorough, covering both long and short-text retrieval, as well as zero-shot and in-domain retrieval. The ablation studies are well-designed, and the evaluation is further supplemented with image classification ben
1. The authors should more clearly articulate the specific improvements of the LC module over ATRM(from FineLIP(https://arxiv.org/pdf/2504.01916), or position it as an adoption of existing technology rather than a novel contribution. 2. While the idea of using an aggregated visual representation for the Subcaption-Aggregated Patch (SAP) loss is logical, the performance improvement appears incremental. 3. The formula on lines 174-176 is not numbered. Furthermore, although this formula is cited fr
1. The paper is well-written and easy to follow. 2. The proposed token reconstruction alignment and subcaption-aggregated patch alignment strategies are interesting and innovative. 3. Experimental results reflect the effectiveness of the proposed method to some extent.
1. The claim in Lines 124-126 of the paper is inappropriate. In fact, MulCLIP does not achieve finer granularity than FG-CLIP, which uses region-proposal assistance; the two methods merely differ in their approaches to fine-grained alignment. 2. The Subcaption-Aggregated Patch Alignment proposed in the paper is similar to [1], yet the paper lacks an explicit comparative discussion about this similarity. 3. FG-CLIP serves as an important baseline for the paper, but there is no comparison with it
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsMultimodal Machine Learning Applications · Generative Adversarial Networks and Image Synthesis · Handwritten Text Recognition Techniques
