Studying the Role of Input-Neighbor Overlap in Retrieval-Augmented Language Models Training Efficiency
Ehsan Doostmohammadi, Marco Kuhlmann

TL;DR
This study explores how query--context overlap influences retrieval-augmented language model training, revealing that increased overlap improves efficiency and performance, especially when artificially enhanced through synthetic paraphrased context.
Contribution
It systematically analyzes the impact of input-neighbor overlap on model training and demonstrates that synthetic context can significantly boost data efficiency and reduce training time.
Findings
Overlap improves test perplexity and learning speed above a threshold
Synthetic paraphrased context enhances data efficiency by ~40%
Benefits extend to question-answering tasks
Abstract
Retrieval-augmented language models have demonstrated performance comparable to much larger models while requiring fewer computational resources. The effectiveness of these models crucially depends on the overlap between query and retrieved context, but the optimal degree of this overlap remains unexplored. In this paper, we systematically investigate how varying levels of query--context overlap affect model performance during both training and inference. Our experiments reveal that increased overlap initially has minimal effect, but substantially improves test-time perplexity and accelerates model learning above a critical threshold. Building on these findings, we demonstrate that deliberately increasing overlap through synthetic context can enhance data efficiency and reduce training time by approximately 40\% without compromising performance. We specifically generate synthetic…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
Taxonomy
TopicsTopic Modeling · Natural Language Processing Techniques
