Token-level Proximal Policy Optimization for Query Generation
Yichen Ouyang, Lu Wang, Fangkai Yang, Pu Zhao, Chenghua Huang,, Jianfeng Liu, Bochen Pang, Yaming Yang, Yuefeng Zhan, Hao Sun, Qingwei Lin,, Saravan Rajmohan, Weiwei Deng, Dongmei Zhang, Feng Sun, Qi Zhang

TL;DR
This paper introduces TPPO, a novel reinforcement learning method that fine-tunes large language models at the token level to generate higher-quality search queries by better inferring user intent.
Contribution
The paper presents TPPO, a token-level reinforcement learning approach that enhances LLMs for query generation, addressing sparse rewards and improving performance over existing methods.
Findings
TPPO significantly outperforms existing query generation methods.
It improves the quality and relevance of generated queries.
Experimental results on open-source and industrial datasets validate its effectiveness.
Abstract
Query generation is a critical task for web search engines (e.g. Google, Bing) and recommendation systems. Recently, state-of-the-art query generation methods leverage Large Language Models (LLMs) for their strong capabilities in context understanding and text generation. However, they still face challenges in generating high-quality queries in terms of inferring user intent based on their web search interaction history. In this paper, we propose Token-level Proximal Policy Optimization (TPPO), a noval approach designed to empower LLMs perform better in query generation through fine-tuning. TPPO is based on the Reinforcement Learning from AI Feedback (RLAIF) paradigm, consisting of a token-level reward model and a token-level proximal policy optimization module to address the sparse reward challenge in traditional RLAIF frameworks. To evaluate the effectiveness and robustness of TPPO,…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
Taxonomy
TopicsDistributed systems and fault tolerance · Cryptography and Data Security · Optimization and Search Problems
