Shallow Preference Signals: Large Language Model Aligns Even Better with Truncated Data?

Xuan Qi; Jiahao Qiu; Xinzhe Juan; Yue Wu; Mengdi Wang

arXiv:2505.17122·cs.CL·May 26, 2025

Shallow Preference Signals: Large Language Model Aligns Even Better with Truncated Data?

Xuan Qi, Jiahao Qiu, Xinzhe Juan, Yue Wu, Mengdi Wang

PDF

1 Video

TL;DR

This paper reveals that large language models primarily rely on early tokens for preference signals, and training on truncated data focusing on these tokens can achieve comparable or better alignment performance, suggesting a shift in alignment strategies.

Contribution

The study uncovers the prevalence of shallow preference signals in LLMs and demonstrates that training on truncated datasets can enhance alignment efficiency and effectiveness.

Findings

01

Models trained on truncated data perform as well or better than full data models.

02

Shallow preference signals are concentrated in the early tokens of responses.

03

Decoding strategies leveraging shallow signals improve alignment and efficiency.

Abstract

Aligning large language models (LLMs) with human preferences remains a key challenge in AI. Preference-based optimization methods, such as Reinforcement Learning with Human Feedback (RLHF) and Direct Preference Optimization (DPO), rely on human-annotated datasets to improve alignment. In this work, we identify a crucial property of the existing learning method: the distinguishing signal obtained in preferred responses is often concentrated in the early tokens. We refer to this as shallow preference signals. To explore this property, we systematically truncate preference datasets at various points and train both reward models and DPO models on the truncated data. Surprisingly, models trained on truncated datasets, retaining only the first half or fewer tokens, achieve comparable or even superior performance to those trained on full datasets. For example, a reward model trained on the…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

Shallow Preference Signals: Large Language Model Aligns Even Better with Truncated Data?· underline