InstructRetro: Instruction Tuning post Retrieval-Augmented Pretraining
Boxin Wang, Wei Ping, Lawrence McAfee, Peng Xu, Bo Li, Mohammad, Shoeybi, Bryan Catanzaro

TL;DR
This paper introduces Retro 48B, the largest retrieval-augmented language model pretrained with retrieval, which significantly improves perplexity and zero-shot task performance after instruction tuning, demonstrating the scalability and effectiveness of retrieval augmentation.
Contribution
The paper presents Retro 48B, the largest retrieval-augmented pretrained language model, and shows its superior performance and efficiency, along with insights into architecture simplification.
Findings
Retro 48B outperforms GPT 43B in perplexity with minimal additional GPU hours.
InstructRetro improves zero-shot task performance by 7-16% over GPT.
Using only the decoder backbone yields comparable results, simplifying architecture.
Abstract
Pretraining auto-regressive large language models~(LLMs) with retrieval demonstrates better perplexity and factual accuracy by leveraging external databases. However, the size of existing pretrained retrieval-augmented LLM is still limited (e.g., Retro has 7.5B parameters), which limits the effectiveness of instruction tuning and zero-shot generalization. In this work, we introduce Retro 48B, the largest LLM pretrained with retrieval. Specifically, we continue to pretrain a 43B GPT model on additional 100 billion tokens using the Retro augmentation method by retrieving from 1.2 trillion tokens. Notably, the obtained foundation model, Retro 48B, largely outperforms the counterpart GPT 43B trained on 1.2T tokens in terms of perplexity with only 2.58% additional GPU hours, demonstrating the significant scaling potential of the method. After instruction tuning on Retro, InstructRetro…
Peer Reviews
Decision·ICML 2024 Poster
1. Proposal of the largest LLM pretrained with retrieval. 2. Good zero-shot question-answering capability.
1. The model is only evaluated on QA tasks 2. The paper should better include the results of retrieval-augmented LMs. 3. The paper could benefit from providing additional explanations or motivation regarding how retrieval-augmented training enhances the performance of LLMs. Could this improvement be attributed to potential data leakage during the training of Retro 48B or continued training with more data?
- The research highlights the benefits of continuing pretraining with retrieval mechanisms before proceeding to instruction tuning, a methodology that hasn't been extensively explored before. - The paper brings to light the enhanced capability of the decoder in context incorporation for QA tasks when it's pretrained with retrieval, offering a fresh perspective on the potential of decoders in LLMs. - The empirical results look nice. Retro 48B demonstrates enhanced perplexity performance when comp
- The scalability, computational costs, and efficiency of training such models might be a concern. - A more diverse set of metrics, especially some human evaluation, could provide a comprehensive understanding of the model's performance.
- This paper introduces the largest scale of LM pre-trained with retrieval (RETRO-48B). - They retrieve relevant chunks from a 1.2T token datastore, and by extensive quantization and efficiency techniques they make retrieval fast and scalable. - They further instruction-tune RETRO-48B on diverse instruction-response pairs.
I like this paper and believe this paper provides great technical contributions in terms of pre-training retrieval-augmented LM at scale. On the other hand, I have several concerns, especially for the instruction tuning part and their downstream task evaluations. That being said, my concerns partially come from confusion between inconsistent descriptions in the paper, and I am happy to increase my score once I am convinced during the discussion period. **1. Technical novelty** Introducing RE
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsTopic Modeling · Natural Language Processing Techniques · Multimodal Machine Learning Applications
MethodsRefunds@Expedia|||How do I get a full refund from Expedia? · Multi-Head Attention · Attention Is All You Need · Linear Layer · Cosine Annealing · Residual Connection · Adam · Layer Normalization · Attention Dropout · Dense Connections
