PixRec: Leveraging Visual Context for Next-Item Prediction in Sequential Recommendation
Sayak Chakrabarty, Souradip Pal

TL;DR
PixRec introduces a vision-language framework that integrates product images and textual attributes to significantly improve sequential recommendation accuracy in e-commerce, demonstrating the value of visual information.
Contribution
This work presents a novel multi-modal recommendation architecture that jointly processes image and text data, enhancing item differentiation beyond text-only models.
Findings
3x improvement in top-rank accuracy
40% improvement in top-10 accuracy
Effective integration of visual features in recommendation systems
Abstract
Large Language Models (LLMs) have recently shown strong potential for usage in sequential recommendation tasks through text-only models, which combine advanced prompt design, contrastive alignment, and fine-tuning on downstream domain-specific data. While effective, these approaches overlook the rich visual information present in many real-world recommendation scenarios, particularly in e-commerce. This paper proposes PixRec - a vision-language framework that incorporates both textual attributes and product images into the recommendation pipeline. Our architecture leverages a vision-language model backbone capable of jointly processing image-text sequences, maintaining a dual-tower structure and mixed training objective while aligning multi-modal feature projections for both item-item and user-item interactions. Using the Amazon Reviews dataset augmented with product images, our…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsMultimodal Machine Learning Applications · Recommender Systems and Techniques · Explainable Artificial Intelligence (XAI)
