Studying the Role of Input-Neighbor Overlap in Retrieval-Augmented Language Models Training Efficiency

Ehsan Doostmohammadi; Marco Kuhlmann

arXiv:2505.14309·cs.CL·May 21, 2025

Studying the Role of Input-Neighbor Overlap in Retrieval-Augmented Language Models Training Efficiency

Ehsan Doostmohammadi, Marco Kuhlmann

PDF

Open Access 1 Video

TL;DR

This study explores how query--context overlap influences retrieval-augmented language model training, revealing that increased overlap improves efficiency and performance, especially when artificially enhanced through synthetic paraphrased context.

Contribution

It systematically analyzes the impact of input-neighbor overlap on model training and demonstrates that synthetic context can significantly boost data efficiency and reduce training time.

Findings

01

Overlap improves test perplexity and learning speed above a threshold

02

Synthetic paraphrased context enhances data efficiency by ~40%

03

Benefits extend to question-answering tasks

Abstract

Retrieval-augmented language models have demonstrated performance comparable to much larger models while requiring fewer computational resources. The effectiveness of these models crucially depends on the overlap between query and retrieved context, but the optimal degree of this overlap remains unexplored. In this paper, we systematically investigate how varying levels of query--context overlap affect model performance during both training and inference. Our experiments reveal that increased overlap initially has minimal effect, but substantially improves test-time perplexity and accelerates model learning above a critical threshold. Building on these findings, we demonstrate that deliberately increasing overlap through synthetic context can enhance data efficiency and reduce training time by approximately 40\% without compromising performance. We specifically generate synthetic…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

Studying the Role of Input-Neighbor Overlap in Retrieval-Augmented Language Models Training Efficiency· underline

Taxonomy

TopicsTopic Modeling · Natural Language Processing Techniques