Benchmarking and Building Long-Context Retrieval Models with LoCo and   M2-BERT

Jon Saad-Falcon; Daniel Y. Fu; Simran Arora; Neel Guha; Christopher; R\'e

arXiv:2402.07440·cs.IR·November 19, 2024·1 cites

Benchmarking and Building Long-Context Retrieval Models with LoCo and M2-BERT

Jon Saad-Falcon, Daniel Y. Fu, Simran Arora, Neel Guha, Christopher, R\'e

PDF

Open Access 10 Models

TL;DR

This paper introduces LoCoV1, a new benchmark for long-context retrieval, and presents M2-BERT, an efficient state-space encoder that significantly outperforms existing models on long document retrieval tasks.

Contribution

The paper develops a novel benchmark for evaluating long-context retrieval and proposes M2-BERT, a scalable, efficient encoder capable of handling documents up to 32K tokens.

Findings

01

M2-BERT outperforms Transformer-based models by at least 23.3 points on LoCoV1.

02

M2-BERT achieves comparable performance with fewer parameters, up to 90x fewer.

03

The proposed pretraining and finetuning strategies enable effective long-context retrieval.

Abstract

Retrieval pipelines-an integral component of many machine learning systems-perform poorly in domains where documents are long (e.g., 10K tokens or more) and where identifying the relevant document requires synthesizing information across the entire text. Developing long-context retrieval encoders suitable for these domains raises three challenges: (1) how to evaluate long-context retrieval performance, (2) how to pretrain a base language model to represent both short contexts (corresponding to queries) and long contexts (corresponding to documents), and (3) how to fine-tune this model for retrieval under the batch size limitations imposed by GPU memory constraints. To address these challenges, we first introduce LoCoV1, a novel 12 task benchmark constructed to measure long-context retrieval where chunking is not possible or not effective. We next present the M2-BERT retrieval encoder,…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Models

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsImage Retrieval and Classification Techniques · Recommender Systems and Techniques · Semantic Web and Ontologies

MethodsBalanced Selection