Turning Adversaries into Allies: Reversing Typographic Attacks for Multimodal E-Commerce Product Retrieval
Janet Jenq, Hongda Shen

TL;DR
This paper introduces a method that enhances multimodal e-commerce product retrieval by rendering textual metadata onto images, improving model robustness against typographic attacks and increasing retrieval accuracy across various datasets and models.
Contribution
It proposes a novel vision-text compression technique that reverses typographic attacks, strengthening image-text alignment for better retrieval performance.
Findings
Consistent improvement in retrieval accuracy across datasets and models.
Effective enhancement for zero-shot multimodal retrieval.
Visually rendering metadata boosts model robustness against attacks.
Abstract
Multimodal product retrieval systems in e-commerce platforms rely on effectively combining visual and textual signals to improve search relevance and user experience. However, vision-language models such as CLIP are vulnerable to typographic attacks, where misleading or irrelevant text embedded in images skews model predictions. In this work, we propose a novel method that reverses the logic of typographic attacks by rendering relevant textual content (e.g., titles, descriptions) directly onto product images to perform vision-text compression, thereby strengthening image-text alignment and boosting multimodal product retrieval performance. We evaluate our method on three vertical-specific e-commerce datasets (sneakers, handbags, and trading cards) using six state-of-the-art vision foundation models. Our experiments demonstrate consistent improvements in unimodal and multimodal retrieval…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsMultimodal Machine Learning Applications · Handwritten Text Recognition Techniques · Advanced Image and Video Retrieval Techniques
