GPT vs RETRO: Exploring the Intersection of Retrieval and Parameter-Efficient Fine-Tuning
Aleksander Ficek, Jiaqi Zeng, Oleksii Kuchaiev

TL;DR
This study compares retrieval-augmented models and GPT with parameter-efficient fine-tuning methods, revealing RETRO's superior zero-shot performance and GPT's higher potential with PEFT, especially in 8B models.
Contribution
First comprehensive comparison of PEFT methods applied to RAG-enhanced GPT and RETRO models across multiple sizes, highlighting their relative strengths and performance trade-offs.
Findings
RETRO outperforms GPT in zero-shot settings due to pre-training.
GPT models achieve higher performance with PEFT techniques.
8B models offer the best cost-performance balance.
Abstract
Parameter-Efficient Fine-Tuning (PEFT) and Retrieval-Augmented Generation (RAG) have become popular methods for adapting large language models while minimizing compute requirements. In this paper, we apply PEFT methods (P-tuning, Adapters, and LoRA) to a modified Retrieval-Enhanced Transformer (RETRO) and a baseline GPT model across several sizes, ranging from 823 million to 48 billion parameters. We show that RETRO models outperform GPT models in zero-shot settings due to their unique pre-training process but GPT models have higher performance potential with PEFT. Additionally, our study indicates that 8B parameter models strike an optimal balance between cost and performance and P-tuning lags behind other PEFT techniques. We further provide a comparative analysis between applying PEFT to an Instruction-tuned RETRO model and base RETRO model. This work presents the first comprehensive…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
Taxonomy
TopicsMachine Learning and Algorithms · Algorithms and Data Compression
MethodsRefunds@Expedia|||How do I get a full refund from Expedia? · Attention Is All You Need · WordPiece · Cosine Annealing · Linear Layer · Attention Dropout · Linear Warmup With Linear Decay · BART · Weight Decay · BERT
