Unbiased Sentence Encoder For Large-Scale Multi-lingual Search Engines
Mahdi Hajiaghayi, Monir Hajiaghayi, Mark Bolin

TL;DR
This paper introduces a universal multi-lingual sentence encoder designed for search engines, trained on diverse datasets including user search data and NLI datasets, to improve semantic similarity scoring across various query lengths and languages.
Contribution
The paper proposes a multi-task training approach that combines heterogeneous datasets to develop a robust, unbiased multi-lingual sentence encoder for large-scale search applications.
Findings
Effective multi-task training leverages diverse datasets.
The encoder improves semantic similarity scoring across languages.
The approach reduces bias from click data and handles various query lengths.
Abstract
In this paper, we present a multi-lingual sentence encoder that can be used in search engines as a query and document encoder. This embedding enables a semantic similarity score between queries and documents that can be an important feature in document ranking and relevancy. To train such a customized sentence encoder, it is beneficial to leverage users search data in the form of query-document clicked pairs however, we must avoid relying too much on search click data as it is biased and does not cover many unseen cases. The search data is heavily skewed towards short queries and for long queries is small and often noisy. The goal is to design a universal multi-lingual encoder that works for all cases and covers both short and long queries. We select a number of public NLI datasets in different languages and translation data and together with user search data we train a language model…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsTopic Modeling · Natural Language Processing Techniques · Multimodal Machine Learning Applications
