F-GRPO: Factorized Group-Relative Policy Optimization for Unified Candidate Generation and Ranking
Rohan Surana, Gagan Mundada, Junda Wu, Xintong Li, Yizhu Jiao, Bowen Jin, Sizhe Zhou, Tong Yu, Ritwik Sinha, Jiawei Han, Jingbo Shang, Julian McAuley

TL;DR
This paper introduces F-GRPO, a unified framework that jointly optimizes candidate generation and ranking in LLM-based retrieval, improving performance over decoupled methods.
Contribution
It proposes a factorized policy optimization method that trains generation and ranking together within a single LLM backbone, addressing credit assignment issues.
Findings
F-GRPO outperforms decoupled baselines on recommendation and QA tasks.
It surpasses supervised methods and is competitive with zero-shot rerankers.
No architectural changes are needed at inference time.
Abstract
Traditional retrieval pipelines optimize utility through stages of candidate retrieval and reranking, where ranking operates over a predefined candidate set. Large Language Models (LLMs) broaden this into a generative process: given a candidate pool, an LLM can generate a subset and order it within a single autoregressive pass. However, this flexibility introduces a new optimization challenge: the model must search a combinatorial output space while receiving utility feedback only after the full ranked list is generated. Because this feedback is defined over the completed sequence, it cannot distinguish whether a poor result arises from failing to generate a relevant subset or from failing to rank that subset correctly. This credit assignment gap makes end-to-end optimization unstable and sample-inefficient. Existing systems often address this by separating candidate generation from…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
