Text-Guided Visual Representation Learning for Robust Multimodal E-Commerce Recommendation

Yufei Guo; Jing Ma; Tianlu Zhang; Shijie Yang; Yanlong Zang; Weijie Ding; Pinghua Gong; Jungong Han

arXiv:2605.17366·cs.IR·May 19, 2026

Text-Guided Visual Representation Learning for Robust Multimodal E-Commerce Recommendation

Yufei Guo, Jing Ma, Tianlu Zhang, Shijie Yang, Yanlong Zang, Weijie Ding, Pinghua Gong, Jungong Han

PDF

TL;DR

This paper introduces TGQ-Former, a text-guided visual representation learning framework that enhances robustness in multimodal e-commerce retrieval by effectively handling noisy product images.

Contribution

It proposes a novel hybrid-query connector and a reliability-aware dual-gated module for improved visual token extraction guided by structured metadata.

Findings

01

TGQ-Former outperforms baselines on large-scale datasets.

02

Achieves a 6.04% improvement in Hit Rate@100.

03

Enhances robustness against promotional overlays and background clutter.

Abstract

Multimodal item embeddings are crucial for e-commerce item-to-item (I2I) retrieval, yet real-world product images often contain promotional overlays and background clutter that inject spurious visual cues and degrade retrieval robustness. This issue is particularly pronounced in MLRM-style pipelines, where a frozen vision encoder is connected to an LLM through a lightweight connector that must selectively aggregate visual tokens. We propose Text-Guided Q-Former (TGQ-Former), a text-guided visual representation learning framework that leverages structured metadata as semantic guidance for visual token extraction while preserving complementary visual evidence. Concretely, TGQ-Former employs a hybrid-query connector to disentangle metadata-anchored and exploratory visual streams, and introduces a lightweight reliability-aware dual-gated vector modulation module to adaptively calibrate…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.