Shall We Pretrain Autoregressive Language Models with Retrieval? A   Comprehensive Study

Boxin Wang; Wei Ping; Peng Xu; Lawrence McAfee; Zihan Liu; Mohammad; Shoeybi; Yi Dong; Oleksii Kuchaiev; Bo Li; Chaowei Xiao; Anima Anandkumar,; Bryan Catanzaro

arXiv:2304.06762·cs.CL·December 22, 2023·1 cites

Shall We Pretrain Autoregressive Language Models with Retrieval? A Comprehensive Study

Boxin Wang, Wei Ping, Peng Xu, Lawrence McAfee, Zihan Liu, Mohammad, Shoeybi, Yi Dong, Oleksii Kuchaiev, Bo Li, Chaowei Xiao, Anima Anandkumar,, Bryan Catanzaro

PDF

Open Access 1 Repo 2 Models

TL;DR

This paper conducts a comprehensive study on pretraining large autoregressive language models with retrieval, demonstrating improved text generation quality and downstream task performance compared to standard GPT, and introduces RETRO++ for enhanced question answering.

Contribution

It provides a scalable recipe for pretraining retrieval-augmented LMs like RETRO and introduces RETRO++, a variant that significantly improves open-domain QA results.

Findings

01

RETRO outperforms GPT in text generation quality and factual accuracy.

02

RETRO largely outperforms GPT on knowledge-intensive tasks.

03

RETRO++ significantly improves open-domain QA performance.

Abstract

Large decoder-only language models (LMs) can be largely improved in terms of perplexity by retrieval (e.g., RETRO), but its impact on text generation quality and downstream task accuracy is unclear. Thus, it is still an open question: shall we pretrain large autoregressive LMs with retrieval? To answer it, we perform a comprehensive study on a scalable pre-trained retrieval-augmented LM (i.e., RETRO) compared with standard GPT and retrieval-augmented GPT incorporated at fine-tuning or inference stages. We first provide the recipe to reproduce RETRO up to 9.5B parameters while retrieving a text corpus with 330B tokens. Based on that, we have the following novel findings: i) RETRO outperforms GPT on text generation with much less degeneration (i.e., repetition), moderately higher factual accuracy, and slightly lower toxicity with a nontoxic retrieval database. ii) On the LM Evaluation…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

NVIDIA/Megatron-LM
pytorchOfficial

Models

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsTopic Modeling · Natural Language Processing Techniques · Speech Recognition and Synthesis

MethodsRefunds@Expedia|||How do I get a full refund from Expedia? · Attention Is All You Need · Residual Connection · Cosine Annealing · Softmax · Linear Layer · Byte Pair Encoding · Layer Normalization · Linear Warmup With Cosine Annealing · Dense Connections