TIGER-FG: Text-Guided Implicit Fine-Grained Grounding for E-commerce Retrieval
Xinyu Sun, Huangyu Dai, Lingtao Mao, Zexin Zheng, Zihan Liang, Ben Chen, Chenyi Lei, Wenwu Ou

TL;DR
TIGER-FG is a novel text-guided framework for fine-grained e-commerce image retrieval that improves accuracy by producing target-focused representations without explicit detection.
Contribution
It introduces a detection-free, text-guided grounding method with dual distillation objectives and a new benchmark suite for realistic e-commerce retrieval scenarios.
Findings
TIGER-FG improves Recall@1 by 6.1 and 34.4 percentage points on two benchmarks.
It generalizes well to noisy and one-to-many retrieval scenarios.
Achieves high performance with only 85.7M parameters and 256-dim embeddings.
Abstract
E-commerce image search often takes a cropped image as the query, while each candidate is represented by full item images and structured text. This image-to-multimodal retrieval setting presents two asymmetries: a modality disparity -- a visual query must match image--text items, and a granularity disparity -- a cropped query must be compared with full images containing background context and possible distractors. Detection-based pipelines handle the granularity disparity through explicit localization but incur extra cost and error propagation, whereas CLIP-style encoders avoid detection, but are vulnerable to backgrounds or irrelevant items. To address these limitations, we propose TIGER-FG, a text-guided implicit fine-grained grounding framework for image-to-multimodal e-commerce retrieval. TIGER-FG uses item text as semantic guidance to produce target-focused item representations…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
