Targeting Misalignment: A Conflict-Aware Framework for Reward-Model-based LLM Alignment

Zixuan Liu; Siavash H. Khajavi; Guangkai Jiang; Xinru Liu

arXiv:2512.09212·cs.CL·January 21, 2026

Targeting Misalignment: A Conflict-Aware Framework for Reward-Model-based LLM Alignment

Zixuan Liu, Siavash H. Khajavi, Guangkai Jiang, Xinru Liu

PDF

Open Access 1 Video

TL;DR

This paper introduces a conflict-aware framework for improving reward-model-based LLM alignment by detecting and addressing proxy-policy conflicts, leading to more robust alignment despite biased reward signals.

Contribution

It proposes novel conflict detection metrics and a targeted feedback algorithm to refine models, addressing misalignment caused by proxy reward inaccuracies.

Findings

01

Enhanced alignment performance on two tasks

02

Effective identification of proxy-policy conflicts

03

Robustness to biased reward signals

Abstract

Reward-model-based fine-tuning is a central paradigm in aligning Large Language Models with human preferences. However, such approaches critically rely on the assumption that proxy reward models accurately reflect intended supervision, a condition often violated due to annotation noise, bias, or limited coverage. This misalignment can lead to undesirable behaviors, where models optimize for flawed signals rather than true human values. In this paper, we investigate a novel framework to identify and mitigate such misalignment by treating the fine-tuning process as a form of knowledge integration. We focus on detecting instances of proxy-policy conflicts, cases where the base model strongly disagrees with the proxy. We argue that such conflicts often signify areas of shared ignorance, where neither the policy nor the reward model possesses sufficient knowledge, making them especially…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

Targeting Misalignment: A Conflict-Aware Framework for Reward-Model-based LLM Alignment· underline

Taxonomy

TopicsTopic Modeling · Natural Language Processing Techniques · Text Readability and Simplification