InstructRetro: Instruction Tuning post Retrieval-Augmented Pretraining

Boxin Wang; Wei Ping; Lawrence McAfee; Peng Xu; Bo Li; Mohammad; Shoeybi; Bryan Catanzaro

arXiv:2310.07713·cs.CL·May 30, 2024·6 cites

InstructRetro: Instruction Tuning post Retrieval-Augmented Pretraining

Boxin Wang, Wei Ping, Lawrence McAfee, Peng Xu, Bo Li, Mohammad, Shoeybi, Bryan Catanzaro

PDF

Open Access 1 Repo 2 Models 3 Reviews

TL;DR

This paper introduces Retro 48B, the largest retrieval-augmented language model pretrained with retrieval, which significantly improves perplexity and zero-shot task performance after instruction tuning, demonstrating the scalability and effectiveness of retrieval augmentation.

Contribution

The paper presents Retro 48B, the largest retrieval-augmented pretrained language model, and shows its superior performance and efficiency, along with insights into architecture simplification.

Findings

01

Retro 48B outperforms GPT 43B in perplexity with minimal additional GPU hours.

02

InstructRetro improves zero-shot task performance by 7-16% over GPT.

03

Using only the decoder backbone yields comparable results, simplifying architecture.

Abstract

Pretraining auto-regressive large language models~(LLMs) with retrieval demonstrates better perplexity and factual accuracy by leveraging external databases. However, the size of existing pretrained retrieval-augmented LLM is still limited (e.g., Retro has 7.5B parameters), which limits the effectiveness of instruction tuning and zero-shot generalization. In this work, we introduce Retro 48B, the largest LLM pretrained with retrieval. Specifically, we continue to pretrain a 43B GPT model on additional 100 billion tokens using the Retro augmentation method by retrieving from 1.2 trillion tokens. Notably, the obtained foundation model, Retro 48B, largely outperforms the counterpart GPT 43B trained on 1.2T tokens in terms of perplexity with only 2.58% additional GPU hours, demonstrating the significant scaling potential of the method. After instruction tuning on Retro, InstructRetro…

Peer Reviews

Decision·ICML 2024 Poster

Reviewer 01Rating 5· marginally below the acceptance thresholdConfidence 3

Strengths

1. Proposal of the largest LLM pretrained with retrieval. 2. Good zero-shot question-answering capability.

Weaknesses

1. The model is only evaluated on QA tasks 2. The paper should better include the results of retrieval-augmented LMs. 3. The paper could benefit from providing additional explanations or motivation regarding how retrieval-augmented training enhances the performance of LLMs. Could this improvement be attributed to potential data leakage during the training of Retro 48B or continued training with more data?

Reviewer 02Rating 6· marginally above the acceptance thresholdConfidence 3

Strengths

- The research highlights the benefits of continuing pretraining with retrieval mechanisms before proceeding to instruction tuning, a methodology that hasn't been extensively explored before. - The paper brings to light the enhanced capability of the decoder in context incorporation for QA tasks when it's pretrained with retrieval, offering a fresh perspective on the potential of decoders in LLMs. - The empirical results look nice. Retro 48B demonstrates enhanced perplexity performance when comp

Weaknesses

- The scalability, computational costs, and efficiency of training such models might be a concern. - A more diverse set of metrics, especially some human evaluation, could provide a comprehensive understanding of the model's performance.

Reviewer 03Rating 8· accept, good paperConfidence 4

Strengths

- This paper introduces the largest scale of LM pre-trained with retrieval (RETRO-48B). - They retrieve relevant chunks from a 1.2T token datastore, and by extensive quantization and efficiency techniques they make retrieval fast and scalable. - They further instruction-tune RETRO-48B on diverse instruction-response pairs.

Weaknesses

I like this paper and believe this paper provides great technical contributions in terms of pre-training retrieval-augmented LM at scale. On the other hand, I have several concerns, especially for the instruction tuning part and their downstream task evaluations. That being said, my concerns partially come from confusion between inconsistent descriptions in the paper, and I am happy to increase my score once I am convinced during the discussion period. **1. Technical novelty** Introducing RE

Code & Models

Repositories

NVIDIA/Megatron-LM
pytorchOfficial

Models

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsTopic Modeling · Natural Language Processing Techniques · Multimodal Machine Learning Applications

MethodsRefunds@Expedia|||How do I get a full refund from Expedia? · Multi-Head Attention · Attention Is All You Need · Linear Layer · Cosine Annealing · Residual Connection · Adam · Layer Normalization · Attention Dropout · Dense Connections