Causally Robust Reward Learning from Reason-Augmented Preference Feedback

Minjune Hwang; Yigit Korkmaz; Daniel Seita; Erdem B{\i}y{\i}k

arXiv:2603.04861·cs.AI·March 6, 2026

Causally Robust Reward Learning from Reason-Augmented Preference Feedback

Minjune Hwang, Yigit Korkmaz, Daniel Seita, Erdem B{\i}y{\i}k

PDF

Open Access 3 Reviews

TL;DR

ReCouPLe is a framework that uses natural language rationales to improve reward learning by focusing on causal features, leading to better generalization and transfer across tasks.

Contribution

It introduces a novel method that leverages reason-augmented feedback to enhance causal robustness in reward models without extra data or fine-tuning.

Findings

01

Outperforms baselines by up to 1.5x in reward accuracy under distribution shifts.

02

Achieves 2x improvement in downstream policy performance on new tasks.

03

Effectively reuses causal signals across multiple tasks with shared semantics.

Abstract

Preference-based reward learning is widely used for shaping agent behavior to match a user's preference, yet its sparse binary feedback makes it especially vulnerable to causal confusion. The learned reward often latches onto spurious features that merely co-occur with preferred trajectories during training, collapsing when those correlations disappear or reverse at test time. We introduce ReCouPLe, a lightweight framework that uses natural language rationales to provide the missing causal signal. Each rationale is treated as a guiding projection axis in an embedding space, training the model to score trajectories based on features aligned with that axis while de-emphasizing context that is unrelated to the stated reason. Because the same rationales (e.g., "avoids collisions", "completes the task faster") can appear across multiple tasks, ReCouPLe naturally reuses the same causal…

Peer Reviews

Decision·ICLR 2026 Poster

Reviewer 01Rating 4Confidence 4

Strengths

- Employs language-based causal alignment by combining simple preference data with natural-language rationales, guiding the model to focus on causally relevant features that reflect the user’s true intent. - Demonstrates zero-shot reward transfer to unseen environments without requiring any additional preference collection or reward model training.

Weaknesses

- The task instructions used are overly simple, and it is unclear whether incorporating language reasoning truly provides an advantage in this setup. It would strengthen the work to include experiments using more diverse language rationales or to analyze whether varying the linguistic expressions for the same rationale improves performance. - The rationale extraction process appears heuristic and heavily dependent on ground-truth rewards (especially in MetaWorld). This reliance limits the method

Reviewer 02Rating 2Confidence 4

Strengths

- The paper applies a method that has been used in goal-conditioned RL to the preference learning setting to boost the limited amount of information available in a preference label. - Addressing the limited amount of information available in a preference signal is a key problem to solve improve general field ability to learn effective and robust reward models. - The paper is easy to read and follow.

Weaknesses

- Some experiments are missing to truly understand where ReCouPLE performs well: - impact of number of preference samples on reward and learned-policy quality - impact of noisy preference labels - impact of noisy rationales - combining data with and without rationales as, in practice, collecting large datasets with rationales will be expensive and impractical - The ManiSkill experiments lack diversity in rationales, so it is not clear how well the results will generalize. - W

Reviewer 03Rating 6Confidence 3

Strengths

The paper is clearly written and easy to follow. The environments and datasets are carefully designed to capture confounding problems in PbRL and the experiments clearly demonstrate the effectiveness of the proposed approach. Although it is not immediately clear how frequently causal confusions occur in real-world settings, I believe the method has potential applications beyond the specific cases studied here. In particular, it could be useful for more general problems where explanations are ava

Weaknesses

My main concern is that it is well-known that language models can be sensitive to phrasing, and this is not really explored in the paper (e.g., using different LMs, paraphrasing). It is also not clear what kind of LMs should be used to best extract the task/reason embeddings. For Fig 3, I think it would be nice to have in-distribution performances also reported; otherwise, it is difficult to tell if the performances stagnate because of problems from RL training or reward modeling. Aside from the

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsReinforcement Learning in Robotics · Recommender Systems and Techniques · Explainable Artificial Intelligence (XAI)