Turning Adversaries into Allies: Reversing Typographic Attacks for Multimodal E-Commerce Product Retrieval

Janet Jenq; Hongda Shen

arXiv:2511.05325·cs.LG·November 10, 2025

Turning Adversaries into Allies: Reversing Typographic Attacks for Multimodal E-Commerce Product Retrieval

Janet Jenq, Hongda Shen

PDF

Open Access

TL;DR

This paper introduces a method that enhances multimodal e-commerce product retrieval by rendering textual metadata onto images, improving model robustness against typographic attacks and increasing retrieval accuracy across various datasets and models.

Contribution

It proposes a novel vision-text compression technique that reverses typographic attacks, strengthening image-text alignment for better retrieval performance.

Findings

01

Consistent improvement in retrieval accuracy across datasets and models.

02

Effective enhancement for zero-shot multimodal retrieval.

03

Visually rendering metadata boosts model robustness against attacks.

Abstract

Multimodal product retrieval systems in e-commerce platforms rely on effectively combining visual and textual signals to improve search relevance and user experience. However, vision-language models such as CLIP are vulnerable to typographic attacks, where misleading or irrelevant text embedded in images skews model predictions. In this work, we propose a novel method that reverses the logic of typographic attacks by rendering relevant textual content (e.g., titles, descriptions) directly onto product images to perform vision-text compression, thereby strengthening image-text alignment and boosting multimodal product retrieval performance. We evaluate our method on three vertical-specific e-commerce datasets (sneakers, handbags, and trading cards) using six state-of-the-art vision foundation models. Our experiments demonstrate consistent improvements in unimodal and multimodal retrieval…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsMultimodal Machine Learning Applications · Handwritten Text Recognition Techniques · Advanced Image and Video Retrieval Techniques