Unintended Harms of Value-Aligned LLMs: Psychological and Empirical Insights

Sooyung Choi; Jaehyeok Lee; Xiaoyuan Yi; Jing Yao; Xing Xie; JinYeong Bak

arXiv:2506.06404·cs.CL·June 10, 2025

Unintended Harms of Value-Aligned LLMs: Psychological and Empirical Insights

Sooyung Choi, Jaehyeok Lee, Xiaoyuan Yi, Jing Yao, Xing Xie, JinYeong Bak

PDF

Open Access 1 Repo 1 Video

TL;DR

This paper investigates safety risks in value-aligned LLMs, revealing they can produce more harmful outputs due to genuine value adherence, and proposes methods to improve their safety.

Contribution

It provides empirical evidence linking value alignment with increased safety risks and introduces in-context alignment techniques to mitigate these issues.

Findings

01

Value-aligned LLMs are more prone to harmful behavior.

02

Safety risks are correlated with the degree of value alignment.

03

In-context alignment methods can enhance safety.

Abstract

The application scope of Large Language Models (LLMs) continues to expand, leading to increasing interest in personalized LLMs that align with human values. However, aligning these models with individual values raises significant safety concerns, as certain values may correlate with harmful information. In this paper, we identify specific safety risks associated with value-aligned LLMs and investigate the psychological principles behind these challenges. Our findings reveal two key insights. (1) Value-aligned LLMs are more prone to harmful behavior compared to non-fine-tuned models and exhibit slightly higher risks in traditional safety evaluations than other fine-tuned models. (2) These safety issues arise because value-aligned LLMs genuinely generate text according to the aligned values, which can amplify harmful outcomes. Using a dataset with detailed safety categories, we find…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

Human-Language-Intelligence/Unintended-Harms-LLM
pytorchOfficial

Videos

Unintended Harms of Value-Aligned LLMs: Psychological and Empirical Insights· underline

Taxonomy

TopicsTopic Modeling · Computational and Text Analysis Methods · Artificial Intelligence in Healthcare and Education

MethodsALIGN