Towards Better Monolingual Japanese Retrievers with Multi-Vector Models

Benjamin Clavi\'e

arXiv:2312.16144·cs.CL·September 24, 2024·2 cites

Towards Better Monolingual Japanese Retrievers with Multi-Vector Models

Benjamin Clavi\'e

PDF

Open Access 2 Models 1 Datasets

TL;DR

This paper introduces JaColBERT, a compact multi-vector retriever trained on limited Japanese data, outperforming existing monolingual and multilingual models in Japanese document retrieval tasks.

Contribution

The paper presents a novel Japanese-specific multi-vector retrieval model that achieves competitive performance with significantly less data and smaller size than multilingual models.

Findings

01

JaColBERT outperforms all existing monolingual Japanese retrievers.

02

It also surpasses multilingual models on out-of-domain tasks.

03

The model uses only 110 million parameters and limited Japanese data.

Abstract

As language-specific training data tends to be sparsely available compared to English, document retrieval in many languages has been largely relying on multilingual models. In Japanese, the best performing deep-learning based retrieval approaches rely on multilingual dense embedders, with Japanese-only models lagging far behind. However, multilingual models require considerably more compute and data to train and have higher computational and memory requirements while often missing out on culturally-relevant information. In this paper, we introduce JaColBERT, a family of multi-vector retrievers trained on two magnitudes fewer data than their multilingual counterparts while reaching competitive performance. Our strongest model largely outperform all existing monolingual Japanese retrievers on all dataset, as well as the strongest existing multilingual models on all out-of-domain tasks,…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Models

Datasets

mteb/VoyageMMarcoReranking
dataset· 996 dl
996 dl

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsTopic Modeling · Natural Language Processing Techniques · Information Retrieval and Search Behavior