Text-Guided Visual Representation Learning for Robust Multimodal E-Commerce Recommendation
Yufei Guo, Jing Ma, Tianlu Zhang, Shijie Yang, Yanlong Zang, Weijie Ding, Pinghua Gong, Jungong Han

TL;DR
This paper introduces TGQ-Former, a text-guided visual representation learning framework that enhances robustness in multimodal e-commerce retrieval by effectively handling noisy product images.
Contribution
It proposes a novel hybrid-query connector and a reliability-aware dual-gated module for improved visual token extraction guided by structured metadata.
Findings
TGQ-Former outperforms baselines on large-scale datasets.
Achieves a 6.04% improvement in Hit Rate@100.
Enhances robustness against promotional overlays and background clutter.
Abstract
Multimodal item embeddings are crucial for e-commerce item-to-item (I2I) retrieval, yet real-world product images often contain promotional overlays and background clutter that inject spurious visual cues and degrade retrieval robustness. This issue is particularly pronounced in MLRM-style pipelines, where a frozen vision encoder is connected to an LLM through a lightweight connector that must selectively aggregate visual tokens. We propose Text-Guided Q-Former (TGQ-Former), a text-guided visual representation learning framework that leverages structured metadata as semantic guidance for visual token extraction while preserving complementary visual evidence. Concretely, TGQ-Former employs a hybrid-query connector to disentangle metadata-anchored and exploratory visual streams, and introduces a lightweight reliability-aware dual-gated vector modulation module to adaptively calibrate…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
