Answer First, Reason Later: Aligning Search Relevance via Mode-Balanced Reinforcement Learning

Shijie Zhang; Xiang Guo; Rujun Guo; Shaoyu Liu; Xiaozhao Wang; Guanjun Jiang; Kevin Zhang

arXiv:2602.10006·cs.LG·February 11, 2026

Answer First, Reason Later: Aligning Search Relevance via Mode-Balanced Reinforcement Learning

Shijie Zhang, Xiang Guo, Rujun Guo, Shaoyu Liu, Xiaozhao Wang, Guanjun Jiang, Kevin Zhang

PDF

Open Access

TL;DR

This paper introduces the AFRL paradigm for search relevance, combining fast response with interpretable reasoning, and proposes a mode-balanced RL training method to improve performance and stability.

Contribution

It presents a novel AFRL framework that balances reinforcement learning and supervised fine-tuning to enhance search relevance models with interpretable reasoning.

Findings

01

Achieves state-of-the-art performance with a 32B model.

02

Enables knowledge distillation to smaller models.

03

Balances mode-seeking and mode-covering in training.

Abstract

Building a search relevance model that achieves both low latency and high performance is a long-standing challenge in the search industry. To satisfy the millisecond-level response requirements of online systems while retaining the interpretable reasoning traces of Large Language Models (LLMs), we propose a novel \textbf{Answer-First, Reason Later (AFRL)} paradigm. This paradigm requires the model to output the definitive relevance score in the very first token, followed by a structured logical explanation. Inspired by the success of reasoning models, we adopt a "Supervised Fine-Tuning (SFT) + Reinforcement Learning (RL)" pipeline to achieve AFRL. However, directly applying existing RL training often leads to \textbf{mode collapse} in the search relevance task, where the model forgets complex long-tail rules in pursuit of high rewards. From an information theory perspective: RL…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsInformation Retrieval and Search Behavior · Expert finding and Q&A systems · Topic Modeling