Distributionally Robust Token Optimization in RLHF

Yeping Jin; Jiaming Hu; Ioannis Ch. Paschalidis

arXiv:2604.08577·cs.LG·May 12, 2026

Distributionally Robust Token Optimization in RLHF

Yeping Jin, Jiaming Hu, Ioannis Ch. Paschalidis

PDF

TL;DR

This paper introduces DRTO, a method combining token-level RLHF with distributionally robust optimization to improve LLM performance under distribution shifts, especially in reasoning tasks.

Contribution

The paper proposes a novel DRTO approach that emphasizes difficult response segments during policy optimization to enhance robustness of LLMs.

Findings

01

DRTO improves consistency under distribution shifts in reasoning benchmarks.

02

Achieves +4.4 percentage points on MATH-500.

03

Achieves +2.7 percentage points on LiveCodeBench.

Abstract

Large Language Models (LLMs) tend to respond correctly to prompts that align well with the data they were trained and fine-tuned on. Yet, small shifts in wording, format, or language can trigger surprisingly large failures, especially on multi-step reasoning problems. To address this problem, we propose a Distributionally Robust Token Optimization (DRTO) approach, which combines token-level Reinforcement Learning from Human Feedback (RLHF) with Distributionally Robust Optimization (DRO). DRTO constructs f-divergence ambiguity sets over span-level actor losses, providing a principled way to emphasize difficult response segments during policy optimization. Empirically, DRTO enhances consistency under distribution shifts in multiple reasoning benchmarks among different tasks, achieving $+ 4.4$ percentage points on MATH-500 and $+ 2.7$ percentage points on LiveCodeBench over standard RTO.

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.