Optimizing RAG Rerankers with LLM Feedback via Reinforcement Learning
Yuhang Wu, Xiangqing Shen, Fanfan Wang, Cangqi Zhou, Zhen Wu, Xinyu Dai, Rui Xia

TL;DR
This paper presents RRPO, a reinforcement learning framework that directly optimizes rerankers for LLM generation quality, reducing reliance on human relevance labels and improving retrieval-augmented generation performance.
Contribution
It introduces a novel RL-based reranking approach aligned with LLM utility, enhancing retrieval results for better downstream generation without extensive human annotations.
Findings
RRPO outperforms existing rerankers like RankZephyr on knowledge-intensive benchmarks.
The framework generalizes to different LLMs such as GPT-4o.
It remains robust even with noisy supervision signals.
Abstract
Rerankers play a pivotal role in refining retrieval results for Retrieval-Augmented Generation. However, current reranking models are typically optimized on static human annotated relevance labels in isolation, decoupled from the downstream generation process. This isolation leads to a fundamental misalignment: documents identified as topically relevant by information retrieval metrics often fail to provide the actual utility required by the LLM for precise answer generation. To bridge this gap, we introduce ReRanking Preference Optimization (RRPO), a reinforcement learning framework that directly aligns reranking with the LLM's generation quality. By formulating reranking as a sequential decision-making process, RRPO optimizes for context utility using LLM feedback, thereby eliminating the need for expensive human annotations. To ensure training stability, we further introduce a…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
