Answer First, Reason Later: Aligning Search Relevance via Mode-Balanced Reinforcement Learning
Shijie Zhang, Xiang Guo, Rujun Guo, Shaoyu Liu, Xiaozhao Wang, Guanjun Jiang, Kevin Zhang

TL;DR
This paper introduces the AFRL paradigm for search relevance, combining fast response with interpretable reasoning, and proposes a mode-balanced RL training method to improve performance and stability.
Contribution
It presents a novel AFRL framework that balances reinforcement learning and supervised fine-tuning to enhance search relevance models with interpretable reasoning.
Findings
Achieves state-of-the-art performance with a 32B model.
Enables knowledge distillation to smaller models.
Balances mode-seeking and mode-covering in training.
Abstract
Building a search relevance model that achieves both low latency and high performance is a long-standing challenge in the search industry. To satisfy the millisecond-level response requirements of online systems while retaining the interpretable reasoning traces of Large Language Models (LLMs), we propose a novel \textbf{Answer-First, Reason Later (AFRL)} paradigm. This paradigm requires the model to output the definitive relevance score in the very first token, followed by a structured logical explanation. Inspired by the success of reasoning models, we adopt a "Supervised Fine-Tuning (SFT) + Reinforcement Learning (RL)" pipeline to achieve AFRL. However, directly applying existing RL training often leads to \textbf{mode collapse} in the search relevance task, where the model forgets complex long-tail rules in pursuit of high rewards. From an information theory perspective: RL…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsInformation Retrieval and Search Behavior · Expert finding and Q&A systems · Topic Modeling
