ColBERT's [MASK]-based Query Augmentation: Effects of Quadrupling the Query Input Length
Ben Giacalone, Richard Zanibbi

TL;DR
This paper investigates how increasing the number of [MASK] tokens in ColBERT's queries affects retrieval performance, finding that extending query length up to four times the original improves results without performance collapse.
Contribution
It demonstrates that augmenting queries with additional [MASK] tokens up to four times the original length enhances ColBERT's retrieval effectiveness without degrading performance.
Findings
Adding [MASK] tokens to extend queries improves retrieval performance.
Performance plateaus when query length reaches an average of 32 tokens.
Extending queries to 128 tokens does not significantly harm performance.
Abstract
A unique aspect of ColBERT is its use of [MASK] tokens in queries to score documents (query augmentation). Prior work shows [MASK] tokens weighting non-[MASK] query terms, emphasizing certain tokens over others , rather than introducing whole new terms as initially proposed. We begin by demonstrating that a term weighting behavior previously reported for [MASK] tokens in ColBERTv1 holds for ColBERTv2. We then examine the effect of changing the number of [MASK] tokens from zero to up to four times past the query input length used in training, both for first stage retrieval, and for scoring candidates, observing an initial decrease in performance with few [MASK]s, a large increase when enough [MASK]s are added to pad queries to an average length of 32, then a plateau in performance afterwards. Additionally, we compare baseline performance to performance when the query length is extended…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsCloud Computing and Resource Management
