SmartCLIP: Modular Vision-language Alignment with Identification Guarantees

Shaoan Xie; Lingjing Kong; Yujia Zheng; Yu Yao; Zeyu Tang; Eric P. Xing; Guangyi Chen; Kun Zhang

arXiv:2507.22264·cs.CV·April 6, 2026

SmartCLIP: Modular Vision-language Alignment with Identification Guarantees

Shaoan Xie, Lingjing Kong, Yujia Zheng, Yu Yao, Zeyu Tang, Eric P. Xing, Guangyi Chen, Kun Zhang

PDF

1 Repo 1 Video

TL;DR

SmartCLIP introduces a modular vision-language alignment framework with theoretical guarantees, effectively handling information misalignment and enabling disentangled, fine-grained representations for improved multimodal understanding.

Contribution

It provides a theoretical foundation for flexible alignment at different granularities and proposes a modular approach, SmartCLIP, with proven effectiveness and available code.

Findings

01

Outperforms existing models on various tasks

02

Handles information misalignment effectively

03

Supports disentangled, fine-grained representations

Abstract

Contrastive Language-Image Pre-training (CLIP)~\citep{radford2021learning} has emerged as a pivotal model in computer vision and multimodal learning, achieving state-of-the-art performance at aligning visual and textual representations through contrastive learning. However, CLIP struggles with potential information misalignment in many image-text datasets and suffers from entangled representation. On the one hand, short captions for a single image in datasets like MSCOCO may describe disjoint regions in the image, leaving the model uncertain about which visual features to retain or disregard. On the other hand, directly aligning long captions with images can lead to the retention of entangled details, preventing the model from learning disentangled, atomic concepts -- ultimately limiting its generalization on certain downstream tasks involving short prompts. In this paper, we…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

Mid-Push/SmartCLIP
github

Videos

SmartCLIP: Modular Vision-language Alignment with Identification Guarantees· slideslive