SLIP: Structural-aware Language-Image Pretraining for Vision-Language Alignment

Wenbo Lu

arXiv:2511.03019·cs.CV·November 6, 2025

SLIP: Structural-aware Language-Image Pretraining for Vision-Language Alignment

Wenbo Lu

PDF

Open Access 1 Datasets

TL;DR

SLIP enhances vision-language pretraining by incorporating relational structure through a graph-based approach, leading to improved cross-modal retrieval and classification performance in zero-shot and few-shot scenarios.

Contribution

It introduces a novel structure-aware pretraining method that models relationships between entities, supported by a large-scale multimodal graph dataset.

Findings

01

SLIP outperforms CLIP on cross-modal tasks.

02

Relational supervision improves alignment accuracy.

03

Effective in zero-shot and few-shot settings.

Abstract

Vision-Language Pretraining (VLP) has achieved remarkable success across various downstream tasks, but such gains are largely driven by scaling up on training data. Yet, literature methods treat image-text pairs as isolated training examples; this neglects the rich relational structure naturally present in many domains, such as e-commerce product co-purchase graphs and social recommendation networks. Inspired by neuroscientific evidence that human encodes knowledge as relationship cognitive maps, we introduce Structure-aware Language-Image Pretraining (SLIP). SLIP integrates a structural contrastive loss to align modalities while also modeling relationships between neighboring entities in a structured graph. To support this paradigm, we construct a large-scale Amazon Product Co-purchase Multimodal Graph Dataset, enabling structured cross-modality supervision at scale. Experiment results…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Datasets

wenboluu/ACMMG
dataset· 19 dl
19 dl

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsMultimodal Machine Learning Applications · Domain Adaptation and Few-Shot Learning · Advanced Graph Neural Networks