Hierarchical Similarity Learning for Language-based Product Image Retrieval
Zhe Ma, Fenghao Liu, Jianfeng Dong, Xiaoye Qu, Yuan He, Shouling Ji

TL;DR
This paper introduces a Hierarchical Similarity Learning network that captures multi-level representations for more accurate language-based product image retrieval, outperforming previous methods by considering multiple granularities.
Contribution
The paper proposes a novel HSL network that models multi-granularity similarities at different levels, enhancing cross-modal retrieval performance.
Findings
Effective on large-scale product dataset
Improves matching accuracy by considering multiple granularities
Code and data are publicly available
Abstract
This paper aims for the language-based product image retrieval task. The majority of previous works have made significant progress by designing network structure, similarity measurement, and loss function. However, they typically perform vision-text matching at certain granularity regardless of the intrinsic multiple granularities of images. In this paper, we focus on the cross-modal similarity measurement, and propose a novel Hierarchical Similarity Learning (HSL) network. HSL first learns multi-level representations of input data by stacked encoders, and object-granularity similarity and image-granularity similarity are computed at each level. All the similarities are combined as the final hierarchical cross-modal similarity. Experiments on a large-scale product retrieval dataset demonstrate the effectiveness of our proposed method. Code and data are available at…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsMultimodal Machine Learning Applications · Advanced Image and Video Retrieval Techniques · Natural Language Processing Techniques
