Towards Better Monolingual Japanese Retrievers with Multi-Vector Models
Benjamin Clavi\'e

TL;DR
This paper introduces JaColBERT, a compact multi-vector retriever trained on limited Japanese data, outperforming existing monolingual and multilingual models in Japanese document retrieval tasks.
Contribution
The paper presents a novel Japanese-specific multi-vector retrieval model that achieves competitive performance with significantly less data and smaller size than multilingual models.
Findings
JaColBERT outperforms all existing monolingual Japanese retrievers.
It also surpasses multilingual models on out-of-domain tasks.
The model uses only 110 million parameters and limited Japanese data.
Abstract
As language-specific training data tends to be sparsely available compared to English, document retrieval in many languages has been largely relying on multilingual models. In Japanese, the best performing deep-learning based retrieval approaches rely on multilingual dense embedders, with Japanese-only models lagging far behind. However, multilingual models require considerably more compute and data to train and have higher computational and memory requirements while often missing out on culturally-relevant information. In this paper, we introduce JaColBERT, a family of multi-vector retrievers trained on two magnitudes fewer data than their multilingual counterparts while reaching competitive performance. Our strongest model largely outperform all existing monolingual Japanese retrievers on all dataset, as well as the strongest existing multilingual models on all out-of-domain tasks,…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsTopic Modeling · Natural Language Processing Techniques · Information Retrieval and Search Behavior
