Doc2Token: Bridging Vocabulary Gap by Predicting Missing Tokens for   E-commerce Search

Kaihao Li; Juexin Lin; Tony Lee

arXiv:2406.19647·cs.IR·July 1, 2024

Doc2Token: Bridging Vocabulary Gap by Predicting Missing Tokens for E-commerce Search

Kaihao Li, Juexin Lin, Tony Lee

PDF

Open Access

TL;DR

Doc2Token is a novel method for e-commerce search that predicts missing relevant tokens in product descriptions, improving retrieval accuracy and efficiency, leading to significant revenue gains in real-world deployment.

Contribution

It introduces a new token prediction approach for document expansion, outperforming Doc2Query in diversity and efficiency, with successful deployment on Walmart.com.

Findings

01

Outperforms Doc2Query in novel ROUGE score and diversity

02

Reduces training and inference times

03

Achieved significant revenue increase in online A/B testing

Abstract

Addressing the "vocabulary mismatch" issue in information retrieval is a central challenge for e-commerce search engines, because product pages often miss important keywords that customers search for. Doc2Query[1] is a popular document-expansion technique that predicts search queries for a document and includes the predicted queries with the document for retrieval. However, this approach can be inefficient for e-commerce search, because the predicted query tokens are often already present in the document. In this paper, we propose Doc2Token, a technique that predicts relevant tokens (instead of queries) that are missing from the document and includes these tokens in the document for retrieval. For the task of predicting missing tokens, we introduce a new metric, "novel ROUGE score". Doc2Token is demonstrated to be superior to Doc2Query in terms of novel ROUGE score and diversity of…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsWeb Data Mining and Analysis · Natural Language Processing Techniques · Sentiment Analysis and Opinion Mining