Improving language models by retrieving from trillions of tokens

Sebastian Borgeaud; Arthur Mensch; Jordan Hoffmann; Trevor Cai; Eliza; Rutherford; Katie Millican; George van den Driessche; Jean-Baptiste Lespiau,; Bogdan Damoc; Aidan Clark; Diego de Las Casas; Aurelia Guy; Jacob Menick,; Roman Ring; Tom Hennigan; Saffron Huang; Loren Maggiore; Chris Jones; Albin; Cassirer; Andy Brock; Michela Paganini; Geoffrey Irving; Oriol Vinyals; Simon; Osindero; Karen Simonyan; Jack W. Rae; Erich Elsen; Laurent Sifre

arXiv:2112.04426·cs.CL·February 9, 2022·297 cites

Improving language models by retrieving from trillions of tokens

Sebastian Borgeaud, Arthur Mensch, Jordan Hoffmann, Trevor Cai, Eliza, Rutherford, Katie Millican, George van den Driessche, Jean-Baptiste Lespiau,, Bogdan Damoc, Aidan Clark, Diego de Las Casas, Aurelia Guy, Jacob Menick,, Roman Ring, Tom Hennigan, Saffron Huang, Loren Maggiore

PDF

Open Access 2 Repos 2 Models 1 Datasets 1 Video

TL;DR

This paper introduces RETRO, a retrieval-augmented language model that uses a large external database to improve performance, matching larger models with fewer parameters and enabling better downstream task results.

Contribution

The paper presents a novel retrieval-augmented transformer architecture that leverages a trillion-token database, significantly reducing model size while maintaining high performance.

Findings

01

RETRO achieves performance comparable to GPT-3 and Jurassic-1 with 25x fewer parameters.

02

Retrieval-augmented models improve downstream question answering tasks.

03

The approach enables rapid fine-tuning of pre-trained transformers with retrieval.

Abstract

We enhance auto-regressive language models by conditioning on document chunks retrieved from a large corpus, based on local similarity with preceding tokens. With a $2$ trillion token database, our Retrieval-Enhanced Transformer (RETRO) obtains comparable performance to GPT-3 and Jurassic-1 on the Pile, despite using 25 $\times$ fewer parameters. After fine-tuning, RETRO performance translates to downstream knowledge-intensive tasks such as question answering. RETRO combines a frozen Bert retriever, a differentiable encoder and a chunked cross-attention mechanism to predict tokens based on an order of magnitude more data than what is typically consumed during training. We typically train RETRO from scratch, yet can also rapidly RETROfit pre-trained transformers with retrieval and still achieve good performance. Our work opens up new avenues for improving language models through explicit…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

Models

Datasets

Agnuxo/Nebula
dataset· 17 dl
17 dl

Videos

[ML News] DeepMind builds Gopher | Google builds GLaM | Suicide capsule uses AI to check access· youtube

Taxonomy

TopicsTopic Modeling · Natural Language Processing Techniques · Speech Recognition and Synthesis

MethodsMulti-Head Attention · Attention Is All You Need · Linear Layer · Cosine Annealing · 15 Ways to Contact How can i speak to someone at Delta Airlines · Attention Dropout · Linear Warmup With Linear Decay · Weight Decay · WordPiece · {Dispute@FaQ-s}How to file a dispute with Expedia?