SLIP: Structural-aware Language-Image Pretraining for Vision-Language Alignment
Wenbo Lu

TL;DR
SLIP enhances vision-language pretraining by incorporating relational structure through a graph-based approach, leading to improved cross-modal retrieval and classification performance in zero-shot and few-shot scenarios.
Contribution
It introduces a novel structure-aware pretraining method that models relationships between entities, supported by a large-scale multimodal graph dataset.
Findings
SLIP outperforms CLIP on cross-modal tasks.
Relational supervision improves alignment accuracy.
Effective in zero-shot and few-shot settings.
Abstract
Vision-Language Pretraining (VLP) has achieved remarkable success across various downstream tasks, but such gains are largely driven by scaling up on training data. Yet, literature methods treat image-text pairs as isolated training examples; this neglects the rich relational structure naturally present in many domains, such as e-commerce product co-purchase graphs and social recommendation networks. Inspired by neuroscientific evidence that human encodes knowledge as relationship cognitive maps, we introduce Structure-aware Language-Image Pretraining (SLIP). SLIP integrates a structural contrastive loss to align modalities while also modeling relationships between neighboring entities in a structured graph. To support this paradigm, we construct a large-scale Amazon Product Co-purchase Multimodal Graph Dataset, enabling structured cross-modality supervision at scale. Experiment results…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsMultimodal Machine Learning Applications · Domain Adaptation and Few-Shot Learning · Advanced Graph Neural Networks
