JaColBERTv2.5: Optimising Multi-Vector Retrievers to Create State-of-the-Art Japanese Retrievers with Constrained Resources
Benjamin Clavi\'e

TL;DR
This paper introduces JaColBERTv2.5, a highly efficient Japanese multi-vector retrieval model that outperforms existing methods by optimizing training and inference techniques, with minimal resource requirements.
Contribution
It presents a novel training recipe and checkpoint merging method for multi-vector models, significantly improving Japanese retrieval performance with fewer resources.
Findings
JaColBERTv2.5 achieves state-of-the-art results on Japanese retrieval benchmarks.
The model is trained in under 15 hours on 4 GPUs with only 110 million parameters.
Performance improvements are validated through comprehensive evaluations.
Abstract
Neural Information Retrieval has advanced rapidly in high-resource languages, but progress in lower-resource ones such as Japanese has been hindered by data scarcity, among other challenges. Consequently, multilingual models have dominated Japanese retrieval, despite their computational inefficiencies and inability to capture linguistic nuances. While recent multi-vector monolingual models like JaColBERT have narrowed this gap, they still lag behind multilingual methods in large-scale evaluations. This work addresses the suboptimal training methods of multi-vector retrievers in lower-resource settings, focusing on Japanese. We systematically evaluate and improve key aspects of the inference and training settings of JaColBERT, and more broadly, multi-vector models. We further enhance performance through a novel checkpoint merging step, showcasing it to be an effective way of combining…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
- 🤗answerdotai/JaColBERTv2.4model· 6 dl· ♡ 46 dl♡ 4
- 🤗answerdotai/JaColBERTv2.5model· 2.7k dl· ♡ 222.7k dl♡ 22
- 🤗answerdotai/answerai-colbert-small-v1model· 1.4M dl· ♡ 1601.4M dl♡ 160
- 🤗brianronan/answerai-colbert-small-v1model
- 🤗sigridjineth/colbert-small-korean-20241212model· 6 dl· ♡ 26 dl♡ 2
- 🤗Derify/ModChemBERT-MLMmodel· 5 dl5 dl
- 🤗Derify/ModChemBERT-MLM-DAPTmodel· 8 dl8 dl
- 🤗Derify/ModChemBERT-MLM-TAFTmodel· 5 dl5 dl
- 🤗Derify/ModChemBERT-MLM-DAPT-TAFTmodel· 4 dl4 dl
- 🤗Derify/ModChemBERTmodel· 9 dl9 dl
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsSemantic Web and Ontologies · Information Retrieval and Search Behavior
