VL-CLIP: Enhancing Multimodal Recommendations via Visual Grounding and LLM-Augmented CLIP Embeddings

Ramin Giahi; Kehui Yao; Sriram Kollipara; Kai Zhao; Vahid Mirjalili; Jianpeng Xu; Topojoy Biswas; Evren Korpeoglu; Kannan Achan

arXiv:2507.17080·cs.IR·July 24, 2025

VL-CLIP: Enhancing Multimodal Recommendations via Visual Grounding and LLM-Augmented CLIP Embeddings

Ramin Giahi, Kehui Yao, Sriram Kollipara, Kai Zhao, Vahid Mirjalili, Jianpeng Xu, Topojoy Biswas, Evren Korpeoglu, Kannan Achan

PDF

Open Access

TL;DR

VL-CLIP enhances multimodal e-commerce recommendations by integrating visual grounding for fine-grained image understanding and LLM-augmented text embeddings, significantly improving retrieval accuracy and user engagement metrics.

Contribution

The paper introduces VL-CLIP, a novel framework that combines visual grounding and LLM-based text enhancement to address limitations of existing vision-language models in e-commerce.

Findings

01

Increased CTR by 18.6%

02

Improved retrieval precision over baseline models

03

Outperformed existing models like FashionCLIP and GCL

Abstract

Multimodal learning plays a critical role in e-commerce recommendation platforms today, enabling accurate recommendations and product understanding. However, existing vision-language models, such as CLIP, face key challenges in e-commerce recommendation systems: 1) Weak object-level alignment, where global image embeddings fail to capture fine-grained product attributes, leading to suboptimal retrieval performance; 2) Ambiguous textual representations, where product descriptions often lack contextual clarity, affecting cross-modal matching; and 3) Domain mismatch, as generic vision-language models may not generalize well to e-commerce-specific data. To address these limitations, we propose a framework, VL-CLIP, that enhances CLIP embeddings by integrating Visual Grounding for fine-grained visual understanding and an LLM-based agent for generating enriched text embeddings. Visual…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsNatural Language Processing Techniques · Semantic Web and Ontologies · Topic Modeling