TL;DR
This paper introduces SGMN, a novel spatiotemporal graph-based multi-modal network that improves livestreaming product retrieval by leveraging text-guided attention, long-range graph modeling, and hard example mining to address key challenges.
Contribution
The paper presents a new multi-modal network architecture that effectively handles distractors, video-image heterogeneity, and subtle product differences in livestreaming product retrieval.
Findings
SGMN outperforms state-of-the-art methods significantly.
The text-guided attention improves focus on intended products.
Hard example mining enhances fine-grained product discrimination.
Abstract
With the rapid expansion of e-commerce, more consumers have become accustomed to making purchases via livestreaming. Accurately identifying the products being sold by salespeople, i.e., livestreaming product retrieval (LPR), poses a fundamental and daunting challenge. The LPR task encompasses three primary dilemmas in real-world scenarios: 1) the recognition of intended products from distractor products present in the background; 2) the video-image heterogeneity that the appearance of products showcased in live streams often deviates substantially from standardized product images in stores; 3) there are numerous confusing products with subtle visual nuances in the shop. To tackle these challenges, we propose the Spatiotemporal Graphing Multi-modal Network (SGMN). First, we employ a text-guided attention mechanism that leverages the spoken content of salespeople to guide the model to…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
MethodsSoftmax · Attention Is All You Need · Focus
