Grounding by Trying: LLMs with Reinforcement Learning-Enhanced Retrieval

Sheryl Hsu; Omar Khattab; Chelsea Finn; Archit Sharma

arXiv:2410.23214·cs.LG·November 1, 2024·2 cites

Grounding by Trying: LLMs with Reinforcement Learning-Enhanced Retrieval

Sheryl Hsu, Omar Khattab, Chelsea Finn, Archit Sharma

PDF

Open Access 3 Reviews

TL;DR

LeReT is a reinforcement learning framework that trains LLMs to generate better search queries by trying different options and learning from successful results, significantly improving retrieval accuracy and answer quality.

Contribution

Introduces LeReT, a novel RL-based method enabling LLMs to learn effective search queries through trial and error, enhancing retrieval and grounding performance.

Findings

01

Up to 29% improvement in retrieval accuracy

02

Up to 17% enhancement in downstream answer quality

03

Applicable to various off-the-shelf retrievers

Abstract

The hallucinations of large language models (LLMs) are increasingly mitigated by allowing LLMs to search for information and to ground their answers in real sources. Unfortunately, LLMs often struggle with posing the right search queries, especially when dealing with complex or otherwise indirect topics. Observing that LLMs can learn to search for relevant facts by $trying$ different queries and learning to up-weight queries that successfully produce relevant results, we introduce $\underline{L e}$ arning to $\underline{R e}$ trieve by $\underline{T}$ rying (LeReT), a reinforcement learning framework that explores search queries and uses preference-based optimization to improve their quality. LeReT can improve the absolute retrieval accuracy by up to 29% and the downstream generator evaluations by 17%. The simplicity and flexibility of LeReT allows it to be applied to arbitrary…

Peer Reviews

Decision·ICLR 2025 Poster

Reviewer 01Rating 6Confidence 4

Strengths

* Well written paper, clear presentation * Results are strong and convincingly support the claim that using RL to improve query generator works better than pure SFT.

Weaknesses

* A bit difficult to judge the novelty of the contribution. Has similar methods been used for single-hop QA or RAG in general? If so, the novelty here might be marginal, especially since the paper relies on direct supervision for the reward model. * The two multi-hop datasets used are not natural and somewhat out-of-date. It would be great to see the methods usefulness on more relevant tasks and benchmarks. The long-form generation attempt in the appendix is interesting, and could perhaps be

Reviewer 02Rating 6Confidence 3

Strengths

1. Introduces a unique reinforcement learning framework to improve retrieval accuracy in LLMs, especially for complex multi-hop queries. 2. Demonstrates the effectiveness of iterative training in enhancing the retrieval and grounding abilities of LLMs. 3. Compatible with various retrieval systems, including ColBERTv2 and Azure AI Search, indicating its broad applicability.

Weaknesses

1. Primarily relies on direct supervision for labeling relevant documents, which may limit its scalability in cases where explicit relevance labels are unavailable. 2. Requires extensive computation due to multi-hop retrieval and diverse query sampling, making it resource-intensive. 3. The need for sampling across multiple hops is computationally intensive and less parallelizable, reducing scalability.

Reviewer 03Rating 6Confidence 3

Strengths

- LeReT is applicable to general retrieval-augmented generation (RAG) systems and can adapt to different retrievers. - LeReT significantly improves retrieval accuracy. Compared to the unadopted Llama and Gemma instruction models, the recall rate increases by 9-22% on HotPotQA and 27-29% on HoVer. - It can be used iteratively: applying LeReT for two iterations shows that the model performance after the second iteration is better than that of the standard non-iterative LeReT.

Weaknesses

- The novelty assertion of the proposed method lacks clarity. Regarding the related work spanning from line 139 to 148, it remains ambiguous as to how the proposed method differentiates itself from those other methods. I comprehend that the proposed approach employs diverse query generation and IPO for preference learning. However, these seem to be more of incremental enhancements within an existing framework rather than representing a distinct novelty. - Also, the experiments lack comparisons w

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsTopic Modeling · Natural Language Processing Techniques · Semantic Web and Ontologies