Beyond Text: Aligning Vision and Language for Multimodal E-Commerce Retrieval

Qujiaheng Zhang; Guagnyue Xu; Fengjie Li

arXiv:2603.04836·cs.IR·March 6, 2026

Beyond Text: Aligning Vision and Language for Multimodal E-Commerce Retrieval

Qujiaheng Zhang, Guagnyue Xu, Fengjie Li

PDF

Open Access

TL;DR

This paper introduces a novel multimodal fusion network for e-commerce retrieval that effectively combines product text and images, significantly improving search relevance by leveraging rich visual signals.

Contribution

It proposes a new modality fusion network and highlights the importance of domain-specific fine-tuning and two-stage alignment for multimodal retrieval.

Findings

01

Enhanced retrieval accuracy on large-scale datasets

02

Effective fusion of text and image modalities

03

Importance of domain-specific fine-tuning

Abstract

Modern e-commerce search is inherently multimodal: customers make purchase decisions by jointly considering product text and visual informations. However, most industrial retrieval and ranking systems primarily rely on textual information, underutilizing the rich visual signals available in product images. In this work, we study unified text-image fusion for two-tower retrieval models in the e-commerce domain. We demonstrate that domain-specific fine-tuning and two stage alignment between query with product text and image modalities are both crucial for effective multimodal retrieval. Building on these insights, we propose a noval modality fusion network to fuse image and text information and capture cross-modal complementary information. Experiments on large-scale e-commerce datasets validate the effectiveness of the proposed approach.

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsMultimodal Machine Learning Applications · Image Retrieval and Classification Techniques · Advanced Image and Video Retrieval Techniques