A Data-Centric Approach to Multilingual E-Commerce Product Search: Case Study on Query-Category and Query-Item Relevance
Yabo Yin, Yang Xi, Jialong Wang, Shanqi Wang, Jiateng Hu

TL;DR
This paper introduces a data-centric framework that improves multilingual e-commerce search relevance by enhancing training data quality and diversity, outperforming model-centric approaches on key tasks.
Contribution
The work presents a novel, architecture-agnostic data augmentation and filtering strategy to address data imbalance and noise in multilingual search models.
Findings
Significant F1 score improvements on the CIKM AnalytiCup 2025 dataset.
Data engineering can outperform complex model modifications.
Effective handling of low-resource languages and noisy labels.
Abstract
Multilingual e-commerce search suffers from severe data imbalance across languages, label noise, and limited supervision for low-resource languages--challenges that impede the cross-lingual generalization of relevance models despite the strong capabilities of large language models (LLMs). In this work, we present a practical, architecture-agnostic, data-centric framework to enhance performance on two core tasks: Query-Category (QC) relevance (matching queries to product categories) and Query-Item (QI) relevance (matching queries to product titles). Rather than altering the model, we redesign the training data through three complementary strategies: (1) translation-based augmentation to synthesize examples for languages absent in training, (2) semantic negative sampling to generate hard negatives and mitigate class imbalance, and (3) self-validation filtering to detect and remove likely…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
