TL;DR
SmartCLIP introduces a modular vision-language alignment framework with theoretical guarantees, effectively handling information misalignment and enabling disentangled, fine-grained representations for improved multimodal understanding.
Contribution
It provides a theoretical foundation for flexible alignment at different granularities and proposes a modular approach, SmartCLIP, with proven effectiveness and available code.
Findings
Outperforms existing models on various tasks
Handles information misalignment effectively
Supports disentangled, fine-grained representations
Abstract
Contrastive Language-Image Pre-training (CLIP)~\citep{radford2021learning} has emerged as a pivotal model in computer vision and multimodal learning, achieving state-of-the-art performance at aligning visual and textual representations through contrastive learning. However, CLIP struggles with potential information misalignment in many image-text datasets and suffers from entangled representation. On the one hand, short captions for a single image in datasets like MSCOCO may describe disjoint regions in the image, leaving the model uncertain about which visual features to retain or disregard. On the other hand, directly aligning long captions with images can lead to the retention of entangled details, preventing the model from learning disentangled, atomic concepts -- ultimately limiting its generalization on certain downstream tasks involving short prompts. In this paper, we…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
