Beyond Text: Aligning Vision and Language for Multimodal E-Commerce Retrieval
Qujiaheng Zhang, Guagnyue Xu, Fengjie Li

TL;DR
This paper introduces a novel multimodal fusion network for e-commerce retrieval that effectively combines product text and images, significantly improving search relevance by leveraging rich visual signals.
Contribution
It proposes a new modality fusion network and highlights the importance of domain-specific fine-tuning and two-stage alignment for multimodal retrieval.
Findings
Enhanced retrieval accuracy on large-scale datasets
Effective fusion of text and image modalities
Importance of domain-specific fine-tuning
Abstract
Modern e-commerce search is inherently multimodal: customers make purchase decisions by jointly considering product text and visual informations. However, most industrial retrieval and ranking systems primarily rely on textual information, underutilizing the rich visual signals available in product images. In this work, we study unified text-image fusion for two-tower retrieval models in the e-commerce domain. We demonstrate that domain-specific fine-tuning and two stage alignment between query with product text and image modalities are both crucial for effective multimodal retrieval. Building on these insights, we propose a noval modality fusion network to fuse image and text information and capture cross-modal complementary information. Experiments on large-scale e-commerce datasets validate the effectiveness of the proposed approach.
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsMultimodal Machine Learning Applications · Image Retrieval and Classification Techniques · Advanced Image and Video Retrieval Techniques
