From Artificial Needles to Real Haystacks: Improving Retrieval Capabilities in LLMs by Finetuning on Synthetic Data
Zheyang Xiong, Vasilis Papageorgiou, Kangwook Lee, Dimitris, Papailiopoulos

TL;DR
This paper introduces a finetuning method using synthetic data to enhance LLMs' retrieval and reasoning abilities in long-context scenarios, demonstrating significant improvements without degrading performance on general benchmarks.
Contribution
The study presents a novel synthetic dataset for finetuning LLMs, significantly improving long-context retrieval and reasoning capabilities while maintaining overall benchmark performance.
Findings
10.5% improvement on MDQA with 20 documents at position 10 for GPT-3.5 Turbo
Finetuning on synthetic data does not cause hallucinations or performance drops on benchmarks like TriviaQA
Synthetic data-based finetuning enhances long-context task performance without harming general abilities.
Abstract
Recent studies have shown that Large Language Models (LLMs) struggle to accurately retrieve information and maintain reasoning capabilities when processing long-context inputs. To address these limitations, we propose a finetuning approach utilizing a carefully designed synthetic dataset comprising numerical key-value retrieval tasks. Our experiments on models like GPT-3.5 Turbo and Mistral 7B demonstrate that finetuning LLMs on this dataset significantly improves LLMs' information retrieval and reasoning capabilities in longer-context settings. We present an analysis of the finetuned models, illustrating the transfer of skills from synthetic to real task evaluations (e.g., improvement on documents MDQA at position for GPT-3.5 Turbo). We also find that finetuned LLMs' performance on general benchmarks remains almost constant while LLMs finetuned on other baseline…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsSemantic Web and Ontologies · Natural Language Processing Techniques
MethodsRefunds@Expedia|||How do I get a full refund from Expedia? · 15 Ways to Contact How can i speak to someone at Delta Airlines · Attention Is All You Need · Cosine Annealing · Linear Layer · Residual Connection · Multi-Head Attention · Weight Decay · Softmax · Layer Normalization
