Distributionally Robust Token Optimization in RLHF
Yeping Jin, Jiaming Hu, Ioannis Ch. Paschalidis

TL;DR
This paper introduces DRTO, a method combining token-level RLHF with distributionally robust optimization to improve LLM performance under distribution shifts, especially in reasoning tasks.
Contribution
The paper proposes a novel DRTO approach that emphasizes difficult response segments during policy optimization to enhance robustness of LLMs.
Findings
DRTO improves consistency under distribution shifts in reasoning benchmarks.
Achieves +4.4 percentage points on MATH-500.
Achieves +2.7 percentage points on LiveCodeBench.
Abstract
Large Language Models (LLMs) tend to respond correctly to prompts that align well with the data they were trained and fine-tuned on. Yet, small shifts in wording, format, or language can trigger surprisingly large failures, especially on multi-step reasoning problems. To address this problem, we propose a Distributionally Robust Token Optimization (DRTO) approach, which combines token-level Reinforcement Learning from Human Feedback (RLHF) with Distributionally Robust Optimization (DRO). DRTO constructs f-divergence ambiguity sets over span-level actor losses, providing a principled way to emphasize difficult response segments during policy optimization. Empirically, DRTO enhances consistency under distribution shifts in multiple reasoning benchmarks among different tasks, achieving percentage points on MATH-500 and percentage points on LiveCodeBench over standard RTO.
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
