ColBERT-Zero: To Pre-train Or Not To Pre-train ColBERT models

Antoine Chaffin; Luca Arnaboldi; Am\'elie Chatelain; Florent Krzakala

arXiv:2602.16609·cs.CL·February 19, 2026

ColBERT-Zero: To Pre-train Or Not To Pre-train ColBERT models

Antoine Chaffin, Luca Arnaboldi, Am\'elie Chatelain, Florent Krzakala

PDF

Open Access 9 Models

TL;DR

This paper demonstrates that large-scale pre-training of multi-vector models like ColBERT significantly improves their performance, with fully pre-trained models outperforming models that rely on knowledge distillation and strong data.

Contribution

It shows the effectiveness of large-scale multi-vector pre-training for ColBERT models and explores training strategies to optimize performance without extensive unsupervised phases.

Findings

01

Fully pre-trained ColBERT-Zero outperforms state-of-the-art models.

02

Supervised pre-training reduces the need for costly unsupervised phases.

03

Aligning fine-tuning and pre-training setups is crucial for optimal results.

Abstract

Current state-of-the-art multi-vector models are obtained through a small Knowledge Distillation (KD) training step on top of strong single-vector models, leveraging the large-scale pre-training of these models. In this paper, we study the pre-training of multi-vector models and show that large-scale multi-vector pre-training yields much stronger multi-vector models. Notably, a fully ColBERT-pre-trained model, ColBERT-Zero, trained only on public data, outperforms GTE-ModernColBERT as well as its base model, GTE-ModernBERT, which leverages closed and much stronger data, setting new state-of-the-art for model this size. We also find that, although performing only a small KD step is not enough to achieve results close to full pre-training, adding a supervised step beforehand allows to achieve much closer performance while skipping the most costly unsupervised phase. Finally, we find that…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Models

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsMachine Learning in Healthcare · Topic Modeling · Explainable Artificial Intelligence (XAI)