Activation Reward Models for Few-Shot Model Alignment

Tianning Chai; Chancharik Mitra; Brandon Huang; Gautam Rajendrakumar Gare; Zhiqiu Lin; Assaf Arbelle; Leonid Karlinsky; Rogerio Feris; Trevor Darrell; Deva Ramanan; Roei Herzig

arXiv:2507.01368·cs.CV·July 3, 2025

Activation Reward Models for Few-Shot Model Alignment

Tianning Chai, Chancharik Mitra, Brandon Huang, Gautam Rajendrakumar Gare, Zhiqiu Lin, Assaf Arbelle, Leonid Karlinsky, Rogerio Feris, Trevor Darrell, Deva Ramanan, Roei Herzig

PDF

Open Access 4 Reviews

TL;DR

This paper introduces Activation Reward Models, a few-shot reward modeling technique that uses activation steering to improve alignment of language and multimodal models with human preferences without additional fine-tuning.

Contribution

The paper presents Activation Reward Models, a novel few-shot reward modeling method that outperforms existing approaches and enhances safety by mitigating reward hacking.

Findings

01

Activation RMs outperform existing few-shot reward models.

02

Activation RMs effectively mitigate reward hacking behaviors.

03

Activation RMs achieve state-of-the-art results on the PreferenceHack benchmark.

Abstract

Aligning Large Language Models (LLMs) and Large Multimodal Models (LMMs) to human preferences is a central challenge in improving the quality of the models' generative outputs for real-world applications. A common approach is to use reward modeling to encode preferences, enabling alignment via post-training using reinforcement learning. However, traditional reward modeling is not easily adaptable to new preferences because it requires a separate reward model, commonly trained on large preference datasets. To address this, we introduce Activation Reward Models (Activation RMs) -- a novel few-shot reward modeling method that leverages activation steering to construct well-aligned reward signals using minimal supervision and no additional model finetuning. Activation RMs outperform existing few-shot reward modeling approaches such as LLM-as-a-judge with in-context learning, voting-based…

Peer Reviews

Decision·ICLR 2026 Conference Withdrawn Submission

Reviewer 01Rating 6Confidence 3

Strengths

This is an interesting paper that provides a new (to my knowledge) mechanism for defining reward models. Mod my concern about statistical significance, this is an interesting method that takes advantage of specialization amongst attention heads and appears to generate improved results.

Weaknesses

# Serious issues / unresolved questions - Which subset of attention heads are selected over during the training phase? All layers? - My biggest concern is that this type of approach could increase the variance of the outcomes, but there are no std. errors indicating the variation across the dataset. This makes it hard to tell if the improvements are real. # Minor issues - The biggest writing issue I have reading this paper is that it has just moved many of the relevant technical details to th

Reviewer 02Rating 2Confidence 4

Strengths

- It is very interesting to consider activation steering for reward modelling. - Experiments show the effectiveness of the proposed method. - The example presentation help me understand the proposed benchmark.

Weaknesses

- The paper is not well written, which is hard to follow. - It seems that the proposed method is a little bit overcomplex and not intuitive, including REINFORCE-based head selection, weighted PCA and token probability scoring, which is a complex pipeline containing multiple steps. - It is unclear which step plays a more important role in your method. - The baselines do not include baselines of the implicit reward modelling, such as DPO, IPO or SimPO. - It seems that the performance on the Reward

Reviewer 03Rating 2Confidence 3

Strengths

The paper addresses an interesting problem, as few-shot reward learning is a viable approach for personalisation in LLMs. Moreover, I think the constructed PreferenceHack benchmark can help investigate reward hacking of positivity, length, and format of the preferences.

Weaknesses

While the paper addresses an interesting problem, and I appreciate the authors' proposed method, I believe that this paper has some significant points that need to be improved upon, or at least addressed, by the authors: - **Missing related work**: There are some key missing related works in the realm of few-shot preference learning. While I understand that most of these require retraining the reward model, it would be important to highlight these and compare the results [1, 2, 3]. Moreover, Al

Reviewer 04Rating 2Confidence 3

Strengths

- Interesting direction: The notion of activation-level alignment—intervening on internal representations instead of fine-tuning—is conceptually novel and could inspire future research.

Weaknesses

1. **Poorly explained methodology:** The core mechanism of ARM is underspecified. It is unclear what the criterion ($j$) refers to, what ( \lambda_j^{\text{ARM}} ) represents (a set of locations $(l, m)$?), or how activations are “injected” at selected head locations. 2. **Unclear use of REINFORCE:** The paper claims to use REINFORCE for head selection but does not define what constitutes the “reward” signal for this optimization or how the stochastic policy over heads is parameterized. 3. **

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsExplainable Artificial Intelligence (XAI) · Multimodal Machine Learning Applications · Recommender Systems and Techniques