PinCLIP: Large-scale Foundational Multimodal Representation at Pinterest

Josh Beal; Eric Kim; Jinfeng Rao; Rex Wu; Dmitry Kislyuk; Charles Rosenberg

arXiv:2603.03544·cs.CV·March 5, 2026

PinCLIP: Large-scale Foundational Multimodal Representation at Pinterest

Josh Beal, Eric Kim, Jinfeng Rao, Rex Wu, Dmitry Kislyuk, Charles Rosenberg

PDF

Open Access

TL;DR

PinCLIP is a large-scale multimodal visual language model designed for Pinterest that improves retrieval, ranking, and cold-start content recommendation by leveraging a novel hybrid architecture and cross-modal alignment objectives, leading to significant business gains.

Contribution

Introduces PinCLIP, a hybrid Vision Transformer architecture with a neighbor alignment objective, enhancing multimodal content understanding and recommendation at Pinterest.

Findings

01

Outperforms state-of-the-art baselines by 20% in retrieval tasks

02

Achieves 15% increase in organic Repins for new content

03

Results in 8.7% higher click-through rate for new ads

Abstract

While multi-modal Visual Language Models (VLMs) have demonstrated significant success across various domains, the integration of VLMs into recommendation and retrieval systems remains a challenge, due to issues like training objective discrepancies and serving efficiency bottlenecks. This paper introduces PinCLIP, a large-scale visual representation learning approach developed to enhance retrieval and ranking models at Pinterest by leveraging VLMs to learn image-text alignment. We propose a novel hybrid Vision Transformer architecture that utilizes a VLM backbone and a hybrid fusion mechanism to capture multi-modality content representation at varying granularities. Beyond standard image-to-text alignment objectives, we introduce a neighbor alignment objective to model the cross-fusion of multi-modal representations within the Pinterest Pin-Board graph. Offline evaluations show that…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsMultimodal Machine Learning Applications · Advanced Image and Video Retrieval Techniques · Generative Adversarial Networks and Image Synthesis