Unsupervised Multilingual Dense Retrieval via Generative Pseudo Labeling

Chao-Wei Huang; Chen-An Li; Tsu-Yuan Hsu; Chen-Yu Hsu; Yun-Nung Chen

arXiv:2403.03516·cs.CL·March 7, 2024·1 cites

Unsupervised Multilingual Dense Retrieval via Generative Pseudo Labeling

Chao-Wei Huang, Chen-An Li, Tsu-Yuan Hsu, Chen-Yu Hsu, Yun-Nung Chen

PDF

Open Access 1 Repo

TL;DR

This paper presents UMR, an unsupervised method for training multilingual dense retrieval models without paired data, using pseudo labeling and iterative improvement, outperforming supervised baselines on benchmark datasets.

Contribution

Introduces UMR, a novel unsupervised framework for multilingual dense retrieval that leverages language models for pseudo labeling without requiring paired data.

Findings

01

UMR outperforms supervised baselines on benchmark datasets.

02

The iterative framework improves retrieval performance.

03

The approach reduces dependence on costly paired data.

Abstract

Dense retrieval methods have demonstrated promising performance in multilingual information retrieval, where queries and documents can be in different languages. However, dense retrievers typically require a substantial amount of paired data, which poses even greater challenges in multilingual scenarios. This paper introduces UMR, an Unsupervised Multilingual dense Retriever trained without any paired data. Our approach leverages the sequence likelihood estimation capabilities of multilingual language models to acquire pseudo labels for training dense retrievers. We propose a two-stage framework which iteratively improves the performance of multilingual dense retrievers. Experimental results on two benchmark datasets show that UMR outperforms supervised baselines, showcasing the potential of training multilingual retrievers without paired data, thereby enhancing their practicality. Our…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

miulab/umr
pytorchOfficial

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsHandwritten Text Recognition Techniques · Natural Language Processing Techniques · Text and Document Classification Technologies