RLHS: Mitigating Misalignment in RLHF with Hindsight Simulation

Kaiqu Liang; Haimin Hu; Ryan Liu; Thomas L. Griffiths; Jaime Fern\'andez Fisac

arXiv:2501.08617·cs.LG·June 11, 2025·2 cites

RLHS: Mitigating Misalignment in RLHF with Hindsight Simulation

Kaiqu Liang, Haimin Hu, Ryan Liu, Thomas L. Griffiths, Jaime Fern\'andez Fisac

PDF

Open Access 1 Models 1 Datasets 3 Reviews

TL;DR

This paper introduces RLHS, a novel approach that mitigates misalignment in RLHF by using hindsight simulation, leading to improved alignment and robustness across various tasks and evaluation methods.

Contribution

The paper proposes RLHS, a new method that uses simulated outcomes to decouple feedback from AI predictions, effectively reducing misalignment in reinforcement learning from human feedback.

Findings

01

RLHS significantly outperforms RLHF in alignment tasks.

02

RLHS demonstrates robustness across multiple evaluation benchmarks.

03

Hindsight simulation reduces systematic misalignment in AI models.

Abstract

While Reinforcement Learning from Human Feedback (RLHF) has shown promise in aligning generative AI, we present empirical evidence that it can also cause severe, systematic misalignment. We hypothesize that this stems from evaluator feedback depending on downstream outcome predictions (foresight) that can be influenced by the AI's output, inducing Goodhart's law dynamics. We present a theoretical analysis showing that conditioning evaluator feedback on downstream observations (hindsight) inhibits this effect by decoupling the alignment signal from potentially compromised predictions--crucially, the result holds even if the observed outcomes are sampled from the AI's own world model. Building on this insight, we introduce Reinforcement Learning from Hindsight Simulation (RLHS), which presents plausible simulated outcomes to evaluators before eliciting feedback. We validate RLHS across…

Peer Reviews

Decision·Submitted to ICLR 2025

Reviewer 01Rating 5Confidence 4

Strengths

- The paper extends our understanding of RLHF by showing that hindsight horizon $N$ exponentially improves the accuracy of utility estimates. Lemma 1 & Theorem 1 show that the difference in finite-hindsight utility estimation approaches the true utility difference as $N$ increases. This is a unique and strong theoretical result. - Applicable to both PPO and DPO-like approaches. - True utility and satisfaction rating are well-defined as concepts to evaluate. - Figures 1-5 and Tables 1-2 are vis

Weaknesses

- Assumes human evaluators operate under $P(s \succ s') = \sigma(\beta (R_T(s) - R_T(s')))$, and completely discounts the possibility of error correction, bounded rationality and cognitive biases. Should at least propose how hindsight simulations would deal with systematic bias/errors and discuss those effects in higher detail. - Should discuss mathematically and empirically the sensitivity of RLHS to different values of the hindsight horizon $N$. - The evaluation utility $(U = 0$, $U = -1$, $U

Reviewer 02Rating 5Confidence 4

Strengths

The paper discusses an interesting perspective using the example of marketplace chatbot, where the authors argue that the preference with responses should be collected after the final outcome instead of directly based on chat experience. This can potentially be a way to help fix the data labeling issues. The authors also provide some experimental results showing that misalignment is mitigated here.

Weaknesses

I have a few major concerns regarding the paper. 1. I'm not fully convinced by the motivation and solution. It seems from the example of marketplace that for the case of deceptive response, the model actually does not follow the instructions and provided responses that are in conflict with the given context. Compared with asking for end outcomes which comes with high uncertainty, it seems that one can easily ask human labeler (or a strong LLM proxy) to label if the responses are consistent with

Reviewer 03Rating 8Confidence 3

Strengths

Originality: Original idea that addresses a real problem with RLHF (which is well described) Quality: Experiments very well designed with both LLMs simulating humans and real humans, and with both PPO and DPO, to support the claims of the paper. Clarity: Very well written and understandable, from the problem formulation, to the explanation of the method. Significance: Could have significant impacts in specific settings where this kind of hindsight simulation data is available

Weaknesses

The quality of the results in this paper seems to rely heavily on the quality of the simulation for hindsight feedback. I think there should be an analysis of using different sizes of models for the hindsight feedback and see what kind of effect this has on performance. The evaluation is also only conducted in one environment (a shopping environment) where the RLHS algorithm makes a lot of sense. I think there should be at least one other environment used in the paper to demonstrate the general

Code & Models

Models

🤗
kaiquliang/Llama-3-8b-RLHF
model· 3 dl
3 dl

Datasets

kaiquliang/RLHS-TestBench
dataset· 12 dl
12 dl

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsReal-time simulation and control systems

MethodsALIGN