TIGER-FG: Text-Guided Implicit Fine-Grained Grounding for E-commerce Retrieval

Xinyu Sun; Huangyu Dai; Lingtao Mao; Zexin Zheng; Zihan Liang; Ben Chen; Chenyi Lei; Wenwu Ou

arXiv:2605.18434·cs.IR·May 19, 2026

TIGER-FG: Text-Guided Implicit Fine-Grained Grounding for E-commerce Retrieval

Xinyu Sun, Huangyu Dai, Lingtao Mao, Zexin Zheng, Zihan Liang, Ben Chen, Chenyi Lei, Wenwu Ou

PDF

1 Datasets

TL;DR

TIGER-FG is a novel text-guided framework for fine-grained e-commerce image retrieval that improves accuracy by producing target-focused representations without explicit detection.

Contribution

It introduces a detection-free, text-guided grounding method with dual distillation objectives and a new benchmark suite for realistic e-commerce retrieval scenarios.

Findings

01

TIGER-FG improves Recall@1 by 6.1 and 34.4 percentage points on two benchmarks.

02

It generalizes well to noisy and one-to-many retrieval scenarios.

03

Achieves high performance with only 85.7M parameters and 256-dim embeddings.

Abstract

E-commerce image search often takes a cropped image as the query, while each candidate is represented by full item images and structured text. This image-to-multimodal retrieval setting presents two asymmetries: a modality disparity -- a visual query must match image--text items, and a granularity disparity -- a cropped query must be compared with full images containing background context and possible distractors. Detection-based pipelines handle the granularity disparity through explicit localization but incur extra cost and error propagation, whereas CLIP-style encoders avoid detection, but are vulnerable to backgrounds or irrelevant items. To address these limitations, we propose TIGER-FG, a text-guided implicit fine-grained grounding framework for image-to-multimodal e-commerce retrieval. TIGER-FG uses item text as semantic guidance to produce target-focused item representations…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Datasets

xyxy01/ECom-RF-IMMR
dataset· 295 dl
295 dl

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.