Improving language models by retrieving from trillions of tokens
Sebastian Borgeaud, Arthur Mensch, Jordan Hoffmann, Trevor Cai, Eliza, Rutherford, Katie Millican, George van den Driessche, Jean-Baptiste Lespiau,, Bogdan Damoc, Aidan Clark, Diego de Las Casas, Aurelia Guy, Jacob Menick,, Roman Ring, Tom Hennigan, Saffron Huang, Loren Maggiore

TL;DR
This paper introduces RETRO, a retrieval-augmented language model that uses a large external database to improve performance, matching larger models with fewer parameters and enabling better downstream task results.
Contribution
The paper presents a novel retrieval-augmented transformer architecture that leverages a trillion-token database, significantly reducing model size while maintaining high performance.
Findings
RETRO achieves performance comparable to GPT-3 and Jurassic-1 with 25x fewer parameters.
Retrieval-augmented models improve downstream question answering tasks.
The approach enables rapid fine-tuning of pre-trained transformers with retrieval.
Abstract
We enhance auto-regressive language models by conditioning on document chunks retrieved from a large corpus, based on local similarity with preceding tokens. With a trillion token database, our Retrieval-Enhanced Transformer (RETRO) obtains comparable performance to GPT-3 and Jurassic-1 on the Pile, despite using 25 fewer parameters. After fine-tuning, RETRO performance translates to downstream knowledge-intensive tasks such as question answering. RETRO combines a frozen Bert retriever, a differentiable encoder and a chunked cross-attention mechanism to predict tokens based on an order of magnitude more data than what is typically consumed during training. We typically train RETRO from scratch, yet can also rapidly RETROfit pre-trained transformers with retrieval and still achieve good performance. Our work opens up new avenues for improving language models through explicit…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
[ML News] DeepMind builds Gopher | Google builds GLaM | Suicide capsule uses AI to check access· youtube
Taxonomy
TopicsTopic Modeling · Natural Language Processing Techniques · Speech Recognition and Synthesis
MethodsMulti-Head Attention · Attention Is All You Need · Linear Layer · Cosine Annealing · 15 Ways to Contact How can i speak to someone at Delta Airlines · Attention Dropout · Linear Warmup With Linear Decay · Weight Decay · WordPiece · {Dispute@FaQ-s}How to file a dispute with Expedia?
