PinCLIP: Large-scale Foundational Multimodal Representation at Pinterest
Josh Beal, Eric Kim, Jinfeng Rao, Rex Wu, Dmitry Kislyuk, Charles Rosenberg

TL;DR
PinCLIP is a large-scale multimodal visual language model designed for Pinterest that improves retrieval, ranking, and cold-start content recommendation by leveraging a novel hybrid architecture and cross-modal alignment objectives, leading to significant business gains.
Contribution
Introduces PinCLIP, a hybrid Vision Transformer architecture with a neighbor alignment objective, enhancing multimodal content understanding and recommendation at Pinterest.
Findings
Outperforms state-of-the-art baselines by 20% in retrieval tasks
Achieves 15% increase in organic Repins for new content
Results in 8.7% higher click-through rate for new ads
Abstract
While multi-modal Visual Language Models (VLMs) have demonstrated significant success across various domains, the integration of VLMs into recommendation and retrieval systems remains a challenge, due to issues like training objective discrepancies and serving efficiency bottlenecks. This paper introduces PinCLIP, a large-scale visual representation learning approach developed to enhance retrieval and ranking models at Pinterest by leveraging VLMs to learn image-text alignment. We propose a novel hybrid Vision Transformer architecture that utilizes a VLM backbone and a hybrid fusion mechanism to capture multi-modality content representation at varying granularities. Beyond standard image-to-text alignment objectives, we introduce a neighbor alignment objective to model the cross-fusion of multi-modal representations within the Pinterest Pin-Board graph. Offline evaluations show that…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsMultimodal Machine Learning Applications · Advanced Image and Video Retrieval Techniques · Generative Adversarial Networks and Image Synthesis
