Token-level Proximal Policy Optimization for Query Generation

Yichen Ouyang; Lu Wang; Fangkai Yang; Pu Zhao; Chenghua Huang,; Jianfeng Liu; Bochen Pang; Yaming Yang; Yuefeng Zhan; Hao Sun; Qingwei Lin,; Saravan Rajmohan; Weiwei Deng; Dongmei Zhang; Feng Sun; Qi Zhang

arXiv:2411.00722·cs.LG·November 4, 2024

Token-level Proximal Policy Optimization for Query Generation

Yichen Ouyang, Lu Wang, Fangkai Yang, Pu Zhao, Chenghua Huang,, Jianfeng Liu, Bochen Pang, Yaming Yang, Yuefeng Zhan, Hao Sun, Qingwei Lin,, Saravan Rajmohan, Weiwei Deng, Dongmei Zhang, Feng Sun, Qi Zhang

PDF

Open Access 1 Video

TL;DR

This paper introduces TPPO, a novel reinforcement learning method that fine-tunes large language models at the token level to generate higher-quality search queries by better inferring user intent.

Contribution

The paper presents TPPO, a token-level reinforcement learning approach that enhances LLMs for query generation, addressing sparse rewards and improving performance over existing methods.

Findings

01

TPPO significantly outperforms existing query generation methods.

02

It improves the quality and relevance of generated queries.

03

Experimental results on open-source and industrial datasets validate its effectiveness.

Abstract

Query generation is a critical task for web search engines (e.g. Google, Bing) and recommendation systems. Recently, state-of-the-art query generation methods leverage Large Language Models (LLMs) for their strong capabilities in context understanding and text generation. However, they still face challenges in generating high-quality queries in terms of inferring user intent based on their web search interaction history. In this paper, we propose Token-level Proximal Policy Optimization (TPPO), a noval approach designed to empower LLMs perform better in query generation through fine-tuning. TPPO is based on the Reinforcement Learning from AI Feedback (RLAIF) paradigm, consisting of a token-level reward model and a token-level proximal policy optimization module to address the sparse reward challenge in traditional RLAIF frameworks. To evaluate the effectiveness and robustness of TPPO,…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

Token-level Proximal Policy Optimization for Query Generation· underline

Taxonomy

TopicsDistributed systems and fault tolerance · Cryptography and Data Security · Optimization and Search Problems