TL;DR
This paper introduces a fully unsupervised method for cross-lingual information retrieval that uses monolingual data to create shared embeddings, enabling effective retrieval without bilingual resources.
Contribution
It presents a novel framework that induces cross-lingual embeddings solely from monolingual corpora using adversarial neural networks, outperforming existing methods.
Findings
Outperforms baseline models using bilingual data.
Effective across language pairs with varying similarity.
Unsupervised ensemble models further improve retrieval performance.
Abstract
We propose a fully unsupervised framework for ad-hoc cross-lingual information retrieval (CLIR) which requires no bilingual data at all. The framework leverages shared cross-lingual word embedding spaces in which terms, queries, and documents can be represented, irrespective of their actual language. The shared embedding spaces are induced solely on the basis of monolingual corpora in two languages through an iterative process based on adversarial neural networks. Our experiments on the standard CLEF CLIR collections for three language pairs of varying degrees of language similarity (English-Dutch/Italian/Finnish) demonstrate the usefulness of the proposed fully unsupervised approach. Our CLIR models with unsupervised cross-lingual embeddings outperform baselines that utilize cross-lingual embeddings induced relying on word-level and document-level alignments. We then demonstrate that…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
