Value Drifts: Tracing Value Alignment During LLM Post-Training

Mehar Bhatia; Shravan Nayak; Gaurav Kamath; Marius Mosbach; Karolina Sta\'nczak; Vered Shwartz; Siva Reddy

arXiv:2510.26707·cs.CL·October 31, 2025

Value Drifts: Tracing Value Alignment During LLM Post-Training

Mehar Bhatia, Shravan Nayak, Gaurav Kamath, Marius Mosbach, Karolina Sta\'nczak, Vered Shwartz, Siva Reddy

PDF

1 Datasets 3 Reviews

TL;DR

This paper investigates how and when large language models develop human-like values during post-training, revealing that initial fine-tuning establishes values and subsequent preference optimization has limited impact on re-alignment.

Contribution

It provides a detailed analysis of value drift during LLM post-training, highlighting the effects of algorithms and datasets on value alignment over time.

Findings

01

SFT phase primarily establishes model values

02

Preference optimization rarely re-aligns existing values

03

Different algorithms lead to different value outcomes

Abstract

As LLMs occupy an increasingly important role in society, they are more and more confronted with questions that require them not only to draw on their general knowledge but also to align with certain human value systems. Therefore, studying the alignment of LLMs with human values has become a crucial field of inquiry. Prior work, however, mostly focuses on evaluating the alignment of fully trained models, overlooking the training dynamics by which models learn to express human values. In this work, we investigate how and at which stage value alignment arises during the course of a model's post-training. Our analysis disentangles the effects of post-training algorithms and datasets, measuring both the magnitude and time of value drifts during training. Experimenting with Llama-3 and Qwen-3 models of different sizes and popular supervised fine-tuning (SFT) and preference optimization…

Peer Reviews

Decision·Submitted to ICLR 2026

Reviewer 01Rating 2Confidence 4

Strengths

1. The paper tackles an important problem (value alignment) from a novel / underexplored perspective (training dynamics). 2. The paper is well-written, with clear figures, tables, experimental setups, etc. 3. The paper showcases differences in values alignment throughout training between preference learning algorithms which could be interesting avenue of further study.

Weaknesses

The main missing piece to this analysis in my mind are understanding the datasets themselves: 1. To be able to say anything meaningful about the claim that models learn during SFT, it seems important to disentangle the stance distribution of the SFT datasets in the analysis. For instance, how closely do the models match the distributions of the datasets used? 2. The SFT vs. preference learning comparison does not disentangle the fact that the datasets are using different query distributions. Th

Reviewer 02Rating 2Confidence 3

Strengths

- The study evaluated several post-training methods, including preference optimization, to measure the value drifts. - The paper explores whether the low value gap in standard preference datasets results in low-value drift by using two distinct scenarios of support-aligned and oppose-aligned, where the preferred labels are switched.

Weaknesses

- There is no explanation provided for why models adhere to the values that were aligned during the SFT phase of preference optimization. All results are discussed empirically, which are also limited. Additionally, there is limited discussion on how different datasets used during SFT result in varying magnitudes of drift during the preference optimization. - The study fails to offer practical recommendations for mitigating value drifts during preference optimization for algorithms such as DPO. T

Reviewer 03Rating 4Confidence 4

Strengths

The paper is well-written and easy to follow and understand. And the authors studied a very interesting and important topic: the models’ output value preference and how the models acquired the value preference through post-training.

Weaknesses

I listed a few weaknesses as below: 1. Overall, the paper is more like an empirical study of model value alignment. It did not propose novel algorithm or framework to perform better model human value alignment, neither construct new dataset or benchmark to perform model human value evaluation. The contribution of the paper is somewhat limited. Seems to be a preliminary work that needs deeper study on this topic, for example, how to efficiently (with less compute/less data points) align model wit

Code & Models

Datasets

McGill-NLP/value-drifts
dataset· 33 dl
33 dl

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.