Alleviating Attention Hacking in Discriminative Reward Modeling through Interaction Distillation

Jianxiang Zang

arXiv:2508.02618·cs.CL·January 14, 2026

Alleviating Attention Hacking in Discriminative Reward Modeling through Interaction Distillation

Jianxiang Zang

PDF

Open Access

TL;DR

This paper introduces Interaction Distillation, a novel training framework that enhances reward models in RLHF by addressing attention hacking issues through attention-level optimization, leading to more stable and generalizable reward signals.

Contribution

It proposes a new interaction-based distillation method that aligns reward model attention patterns with a teacher model to improve discriminative reward modeling.

Findings

01

Interaction Distillation yields more stable reward signals.

02

It outperforms state-of-the-art RM optimization methods.

03

Addresses fundamental limitations in current discriminative reward models.

Abstract

The reward model (RM), as the core component of reinforcement learning from human feedback (RLHF) for large language models (LLMs), responsible for providing reward signals to generated responses. However, the mainstream discriminative reward modeling is inadequate in terms of token-level interaction, making its judgment signals vulnerable to being hacked by misallocated attention to context. This stems from two fundamental limitations: (1) Current preference modeling employs decoder-only architectures, where the unidirectional causal attention mechanism leads to forward-decaying intra-sequence attention within the prompt-response sequence. (2) The independent Siamese-encoding paradigm induces the absence of token-level inter-sequence attention between chosen and rejected sequences. To address this "attention hacking", we propose "Interaction Distillation", a novel training framework…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsEmotion and Mood Recognition · Recommender Systems and Techniques · Explainable Artificial Intelligence (XAI)