Explainable reinforcement learning from human feedback to improve alignment

Shicheng Liu; Siyuan Xu; Wenjie Qiu; Hangfan Zhang; Minghui Zhu

arXiv:2512.13837·cs.LG·December 17, 2025

Explainable reinforcement learning from human feedback to improve alignment

Shicheng Liu, Siyuan Xu, Wenjie Qiu, Hangfan Zhang, Minghui Zhu

PDF

Open Access

TL;DR

This paper introduces a method to enhance reinforcement learning from human feedback for language models by explaining unsatisfactory responses and unlearning problematic training data, leading to improved alignment.

Contribution

It presents a novel post-hoc explanation and unlearning approach to correct causes of unsatisfactory responses in RLHF-trained language models.

Findings

01

The method effectively explains why certain responses are generated.

02

Unlearning identified training data improves response quality.

03

The approach maintains performance on satisfactory responses.

Abstract

A common and effective strategy for humans to improve an unsatisfactory outcome in daily life is to find a cause of this outcome and correct the cause. In this paper, we investigate whether this human improvement strategy can be applied to improving reinforcement learning from human feedback (RLHF) for alignment of language models (LMs). In particular, it is observed in the literature that LMs tuned by RLHF can still output unsatisfactory responses. This paper proposes a method to improve the unsatisfactory responses by correcting their causes. Our method has two parts. The first part proposes a post-hoc explanation method to explain why an unsatisfactory response is generated to a prompt by identifying the training data that lead to this response. We formulate this problem as a constrained combinatorial optimization problem where the objective is to find a set of training data closest…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsExplainable Artificial Intelligence (XAI) · Topic Modeling · Multimodal Machine Learning Applications