Synthetic Data Powers Product Retrieval for Long-tail Knowledge-Intensive Queries in E-commerce Search
Gui Ling, Weiyuan Li, Yue Jiang, Wenjun Peng, Xingxian Liu, Dongshuai Li, Fuyu Lv, Dan Ou, and Haihong Tang

TL;DR
This paper introduces a synthetic data generation framework leveraging large language models to improve product retrieval for long-tail, knowledge-intensive e-commerce queries, resulting in better retrieval accuracy and user experience.
Contribution
It presents a novel data synthesis approach that distills query rewriting capabilities of LLMs into an efficient retrieval system for challenging long-tail queries.
Findings
Synthetic data improves retrieval performance without extra tricks.
Significant user experience enhancements observed in human evaluations.
The approach effectively addresses long-tail, knowledge-intensive query challenges.
Abstract
Product retrieval is the backbone of e-commerce search: for each user query, it identifies a high-recall candidate set from billions of items, laying the foundation for high-quality ranking and user experience. Despite extensive optimization for mainstream queries, existing systems still struggle with long-tail queries, especially knowledge-intensive ones. These queries exhibit diverse linguistic patterns, often lack explicit purchase intent, and require domain-specific knowledge reasoning for accurate interpretation. They also suffer from a shortage of reliable behavioral logs, which makes such queries a persistent challenge for retrieval optimization. To address these issues, we propose an efficient data synthesis framework tailored to retrieval involving long-tail, knowledge-intensive queries. The key idea is to implicitly distill the capabilities of a powerful offline…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
