Efficiently Teaching an Effective Dense Retriever with Balanced Topic   Aware Sampling

Sebastian Hofst\"atter; Sheng-Chieh Lin; Jheng-Hong Yang; Jimmy Lin,; Allan Hanbury

arXiv:2104.06967·cs.IR·May 27, 2021

Efficiently Teaching an Effective Dense Retriever with Balanced Topic Aware Sampling

Sebastian Hofst\"atter, Sheng-Chieh Lin, Jheng-Hong Yang, Jimmy Lin,, Allan Hanbury

PDF

4 Repos 1 Models

TL;DR

This paper introduces TAS-Balanced, a resource-efficient training method for dense retrieval models that achieves state-of-the-art low-latency results using only a single GPU, significantly reducing training costs while improving retrieval performance.

Contribution

The paper presents a novel topic-aware sampling technique and dual-teacher supervision for training dense retrievers efficiently on limited hardware, outperforming existing methods.

Findings

01

Achieves state-of-the-art low-latency retrieval results

02

Outperforms BM25 and previous dense models on TREC-DL datasets

03

Operates effectively on a single consumer-grade GPU

Abstract

A vital step towards the widespread adoption of neural retrieval models is their resource efficiency throughout the training, indexing and query workflows. The neural IR community made great advancements in training effective dual-encoder dense retrieval (DR) models recently. A dense text retrieval model uses a single vector representation per query and passage to score a match, which enables low-latency first stage retrieval with a nearest neighbor search. Increasingly common, training approaches require enormous compute power, as they either conduct negative passage sampling out of a continuously updating refreshing index or require very large batch sizes for in-batch negative sampling. Instead of relying on more compute capability, we introduce an efficient topic-aware query and balanced margin sampling technique, called TAS-Balanced. We cluster queries once before training and…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

Models

🤗
sebastian-hofstaetter/distilbert-dot-tas_b-b256-msmarco
model· 4.9k dl· ♡ 26
4.9k dl♡ 26

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.