Offline RL by Reward-Weighted Fine-Tuning for Conversation Optimization

Subhojyoti Mukherjee; Viet Dac Lai; Raghavendra Addanki; Ryan Rossi; Seunghyun Yoon; Trung Bui; Anup Rao; Jayakumar Subramanian; and Branislav Kveton

arXiv:2506.06964·cs.CL·February 17, 2026

Offline RL by Reward-Weighted Fine-Tuning for Conversation Optimization

Subhojyoti Mukherjee, Viet Dac Lai, Raghavendra Addanki, Ryan Rossi, Seunghyun Yoon, Trung Bui, Anup Rao, Jayakumar Subramanian, and Branislav Kveton

PDF

1 Video

TL;DR

This paper introduces a reward-weighted fine-tuning method for offline reinforcement learning with large language models, improving question-answering policies by directly optimizing rewards and outperforming existing supervised fine-tuning approaches.

Contribution

The paper presents a novel reward-weighted fine-tuning approach for offline RL with LLMs, simplifying the process and enhancing reward and language quality in question-answering tasks.

Findings

01

Major gains in optimized rewards

02

Improved language quality

03

Outperforms state-of-the-art methods

Abstract

Offline reinforcement learning (RL) is a variant of RL where the policy is learned from a previously collected dataset of trajectories and rewards. In our work, we propose a practical approach to offline RL with large language models (LLMs). We recast the problem as reward-weighted fine-tuning, which can be solved using similar techniques to supervised fine-tuning (SFT). To showcase the value of our approach, we apply it to learning short-horizon question-answering policies of a fixed length, where the agent reasons about potential answers or asks clarifying questions. Our work stands in a stark contrast to state-of-the-art methods in this domain, based on SFT and direct preference optimization, which have additional hyper-parameters and do not directly optimize for rewards. We compare to them empirically, and report major gains in both optimized rewards and language quality.

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

Offline RL by Reward-Weighted Fine-Tuning for Conversation Optimization· slideslive

Taxonomy

MethodsShrink and Fine-Tune